
Human communicative language is inherently multimodal. We as humans use a combination of language, gesture, and voice to convey our intentions. Thus there are three modalities present in human multimodal language: language, vision, and acoustics. Multimodal sentiment analysis is an extension of the current language-based sentiment analysis (mostly applied on written text and tweets) to a multimodal setup. Similarly, emotions can be inferred from multimodal configurations based on cues in language, gesture, and voice. Differences in communicative traits can be mapped to different speaker characteristics including persuasiveness and presentation skills. All of these various forms of analysis can be performed by observing the communicative behavior of a speaker. Sentiment analysis, emotion recognition, and speaker trait recognition can be done at video level, utterance level or sentence level.
The MultiComp lab has developed multiple datasets over the years to enable the studies targeting multimodal sentiment analysis, emotion recognition, and speaker traits recognition. CMU-MOSI2 is the largest dataset of multimodal sentiment analysis and emotion recognition at the sentence level. This dataset will be publicly available in February 2018. CMU-MOSI is a dataset of opinion-level sentiment intensity analysis. EMO-REACT is a child emotion recognition dataset. POM is a dataset of multimodal sentiment analysis and speaker traits recognition. ICT-MMMO is a dataset of video-level multimodal sentiment analysis. YouTube is a dataset of utterance-level multimodal sentiment analysis. MOUD is a dataset of utterance-level multimodal sentiment analysis in Spanish.
S. Ghosh, E. Laksana, L.-P. Morency and S. Scherer, Representation Learning for Speech Emotion Recognition, In Proceedings of the Annual Conference of the International Speech Communication Association (Interpseech), 2016
S. Ghosh, E. Laksana, L.-P. Morency and S. Scherer, Learning Representations of Affect from Speech, In Proceedings of the International Conference on Learning Representations Workshop (ICLR-W), 2016
L.-P. Morency. The Role of Context in Affective Behavior Understanding. Social Emotions in Nature and Artifact: Emotions in Human and Human-Computer Interaction, Jonathan Gratch and Stacy Marsella, Editors, Oxford University Press, 2014
J.-C. Levesque, C. Gagne and L.-P. Morency. Sequential Emotion Recognition using Latent-Dynamic Conditional Neural Fields. In Proceedings of the IEEE Conference on Automatic Face and Gesture Recognition (FG), 2013
Y. Song, L.-P. Morency and R. Davis. Learning a Sparse Codebook of Facial and Body Expressions for Audio-Visual Emotion Recognition. In Proceedings of the ACM International Conference on Multimodal Interaction (ICMI), 2013
D.Ozkan, S. Scherer and L.-P. Morency. Step-wise Emotion Recognition using Concatenated-HMM. In Proceedings of the 2nd International Audio/Visual Emotion Challenge and Workshop (AVEC), in conjunction with International Conference on Multimodal Interfaces (ICMI), Santa Monica, October, 2012