J.-C. Levesque, C. Gagne and L.-P. Morency. Sequential Emotion Recognition using Latent-Dynamic Conditional Neural Fields. In Proceedings of the IEEE Conference on Automatic Face and Gesture Recognition (FG), 2013
Human communicative language is inherently multimodal. We as humans use a combination of language, gesture, and voice to convey our intentions. Thus there are three modalities present in human multimodal language: language, vision, and acoustics. Multimodal sentiment analysis is an extension of the current language-based sentiment analysis (mostly applied on written text and tweets) to a multimodal setup. Similarly, emotions can be inferred from multimodal configurations based on cues in language, gesture, and voice. Differences in communicative traits can be mapped to different speaker characteristics including persuasiveness and presentation skills. All of these various forms of analysis can be performed by observing the communicative behavior of a speaker. Sentiment analysis, emotion recognition, and speaker trait recognition can be done at video level, utterance level or sentence level.
The MultiComp lab has developed multiple datasets over the years to enable the studies targeting multimodal sentiment analysis, emotion recognition, and speaker traits recognition. CMU-MOSI2 is the largest dataset of multimodal sentiment analysis and emotion recognition at the sentence level. This dataset will be publicly available in February 2018. CMU-MOSI is a dataset of opinion-level sentiment intensity analysis. EMO-REACT is a child emotion recognition dataset. POM is a dataset of multimodal sentiment analysis and speaker traits recognition. ICT-MMMO is a dataset of video-level multimodal sentiment analysis. YouTube is a dataset of utterance-level multimodal sentiment analysis. MOUD is a dataset of utterance-level multimodal sentiment analysis in Spanish.
Typical techniques for sequence modeling rely upon well-segmented sequences which have been edited to remove noisy or irrelevant parts. Therefore, we cannot easily apply such methods to noisy sequences expected in real-world applications.
In one of our projects, we study sequence modeling through the combination of RNNs that captures the temporal dependencies and the attention mechanism that localizes the salient observations which are relevant to the final decision and ignore the irrelevant (noisy) parts of the input sequence.
More recent work uses more powerful neural network models such as Transformers to process longer sequences. One of our more recent projects uses a hierarchical architecture to model multiple temporal resolutions of sequences, allowing representations of data with different degrees of granularity and more easily capturing long-range dependencies. This work is being applied to music generation, where modeling hierarchical structure is especially important.