H. Wang, A. Meghawat, L.-P. Morency and E. Xing. Select-Additive Learning: Improving Generalization in Multimodal Sentiment Analysis. In Proceedings of the IEEE International Conference on Multimedia & Expo (ICME), 2017
Summarization is widely explored in natural language that aims to create a summary that retains the most important points of the original document. In multimodal summarization, we learn both the hierarchical feature representation to capture the high-level concepts and the interaction between text and video contents.
Hierarchical Sequence Summarization: Utilizing the success of hierarchical feature representation in various computer vision tasks, we build up a hierarchy dynamically and recursively by alternating sequence learning and sequence summarization for studying action recognition.
Summarization for TV Script: Sequences found at the beginning of TV shows help the audience absorb the essence of previous episodes, and grab their attention with upcoming plots. We study the TV recap summarization that distinguishes from the traditional text summarization as we expect the summary to capture the duality of summarization and plot contingency between adjacent episodes.
One of the greatest challenges of multimodal data is to summarize the information from multiple modalities (or views) in a way that complementary information is used as a conglomerate while filtering out the redundant parts of the modalities. Due to the heterogeneity of the data, some challenges naturally spring up including different kinds of noise, alignment of modalities (or views) and, techniques to handle missing data. We study multimodal representations using two broad approaches, Joint and Coordinated.
Joint Representations involve projecting all the modalities to a common space while preserving information from the given modalities. Data from all modalities is required at training and inference time which can potentially make dealing with missing data hard. In our study, we propose a recurrent model which can fuse different views of a modality at each time-step and finally use the joint representation to complete the task at hand (like classification, regression, etc.).
Coordinated Representations involve projecting all the modalities to their space, but those spaces are coordinated using a constraint. This kind of an approach is more useful for modalities which are fundamentally very different and might not work well in a joint space. Due to the variety of modalities in nature, Coordinated Representations have a huge advantage over Joint Representations which gives us reason to believe that the coordination using constraints is the way to go in the field of multimodal representation.
Human communicative language is inherently multimodal. We as humans use a combination of language, gesture, and voice to convey our intentions. Thus there are three modalities present in human multimodal language: language, vision, and acoustics. Multimodal sentiment analysis is an extension of the current language-based sentiment analysis (mostly applied on written text and tweets) to a multimodal setup. Similarly, emotions can be inferred from multimodal configurations based on cues in language, gesture, and voice. Differences in communicative traits can be mapped to different speaker characteristics including persuasiveness and presentation skills. All of these various forms of analysis can be performed by observing the communicative behavior of a speaker. Sentiment analysis, emotion recognition, and speaker trait recognition can be done at video level, utterance level or sentence level.
The MultiComp lab has developed multiple datasets over the years to enable the studies targeting multimodal sentiment analysis, emotion recognition, and speaker traits recognition. CMU-MOSI2 is the largest dataset of multimodal sentiment analysis and emotion recognition at the sentence level. This dataset will be publicly available in February 2018. CMU-MOSI is a dataset of opinion-level sentiment intensity analysis. EMO-REACT is a child emotion recognition dataset. POM is a dataset of multimodal sentiment analysis and speaker traits recognition. ICT-MMMO is a dataset of video-level multimodal sentiment analysis. YouTube is a dataset of utterance-level multimodal sentiment analysis. MOUD is a dataset of utterance-level multimodal sentiment analysis in Spanish.