One of the greatest challenges of multimodal data is to summarize the information from multiple modalities (or views) in a way that complementary information is used as a conglomerate while filtering out the redundant parts of the modalities. Due to the heterogeneity of the data, some challenges naturally spring up including different kinds of noise, alignment of modalities (or views) and, techniques to handle missing data. We study multimodal representations using two broad approaches, Joint and Coordinated.
Joint Representations involve projecting all the modalities to a common space while preserving information from the given modalities. Data from all modalities is required at training and inference time which can potentially make dealing with missing data hard. In our study, we propose a recurrent model which can fuse different views of a modality at each time-step and finally use the joint representation to complete the task at hand (like classification, regression, etc.).
Coordinated Representations involve projecting all the modalities to their space, but those spaces are coordinated using a constraint. This kind of an approach is more useful for modalities which are fundamentally very different and might not work well in a joint space. Due to the variety of modalities in nature, Coordinated Representations have a huge advantage over Joint Representations which gives us reason to believe that the coordination using constraints is the way to go in the field of multimodal representation.
H. Wang, A. Meghawat, L.-P. Morency and E. Xing. Select-Additive Learning: Improving Generalization in Multimodal Sentiment Analysis. In Proceedings of the IEEE International Conference on Multimedia & Expo (ICME), 2017
S. Ghosh, E. Laksana, L.-P. Morency and S. Scherer, Representation Learning for Speech Emotion Recognition, In Proceedings of the Annual Conference of the International Speech Communication Association (Interpseech), 2016
S. Ghosh, E. Laksana, L.-P. Morency and S. Scherer, Learning Representations of Affect from Speech, In Proceedings of the International Conference on Learning Representations Workshop (ICLR-W), 2016
M. Worsley, L.-P. Morency, S. Scherer and P. Blikstein. Exploring Behavior Representation for Learning Analytics. In Proceedings of the ACM International Conference on Multimodal Interaction (ICMI), 2015
S. Park, P. Shoemark and L.-P. Morency, Toward Crowdsourcing Micro-Level Behavior Annotations: The Challenges of Interface, Training, and Generalization. In Proceedings of the International Conference on Intelligent User Interfaces (IUI), 2014
C.Miller, F. Quek and L.-P. Morency, Search Strategies for Pattern Identification in Multimodal Data: Three Case Studies. In Proceedings of the International Conference on Multimedia Retrieval (ICMR), 2014
Y. Song, L.-P. Morency and R. Davis. Distribution-Sensitive Learning for Imbalanced Datasets. In Proceedings of the IEEE Conference on Automatic Face and Gesture Recognition (FG), 2013
Y. Song, L.-P. Morency and R. Davis. Learning a Sparse Codebook of Facial and Body Expressions for Audio-Visual Emotion Recognition. In Proceedings of the ACM International Conference on Multimodal Interaction (ICMI), 2013
S. Scherer, S. Marsella, G. Stratou, Y. Xu, F. Morbini, A. Egan, A. Rizzo, and L.-P. Morency. Perception Markup Language: Towards a Standardized Representation of Perceived Nonverbal Behaviors. In Proceedings of the Conference on Intelligent Virtual Agents (IVA), 2012
Y. Song, L.-P. Morency and R. Davis. Multimodal Human Behavior Analysis: Learning Correlation and Interaction Across Modalities. In Proceedings of the ACM International Conference on Multimodal Interactions (ICMI), 2012
L.-P. Morency, I. de Kok and J. Gratch. Context-based Recognition during Human Interactions: Automatic Feature Selection and Encoding Dictionary. 10th International Conference on Multimodal Interfaces (ICMI), October 2008 **Best paper award**