Representation Learning for Speech Emotion Recognition

Publications

One of the greatest challenges of multimodal data is to summarize the information from multiple modalities (or views) in a way that complementary information is used as a conglomerate while filtering out the redundant parts of the modalities. Due to the heterogeneity of the data, some challenges naturally spring up including different kinds of noise, alignment of modalities (or views) and, techniques to handle missing data. We study multimodal representations using two broad approaches, Joint and Coordinated.

Joint Representations involve projecting all the modalities to a common space while preserving information from the given modalities. Data from all modalities is required at training and inference time which can potentially make dealing with missing data hard. In our study, we propose a recurrent model which can fuse different views of a modality at each time-step and finally use the joint representation to complete the task at hand (like classification, regression, etc.).

Coordinated Representations involve projecting all the modalities to their space, but those spaces are coordinated using a constraint. This kind of an approach is more useful for modalities which are fundamentally very different and might not work well in a joint space. Due to the variety of modalities in nature, Coordinated Representations have a huge advantage over Joint Representations which gives us reason to believe that the coordination using constraints is the way to go in the field of multimodal representation.

Human communicative language is inherently multimodal. We as humans use a combination of language, gesture, and voice to convey our intentions. Thus there are three modalities present in human multimodal language: language, vision, and acoustics. Multimodal sentiment analysis is an extension of the current language-based sentiment analysis (mostly applied on written text and tweets) to a multimodal setup. Similarly, emotions can be inferred from multimodal configurations based on cues in language, gesture, and voice. Differences in communicative traits can be mapped to different speaker characteristics including persuasiveness and presentation skills. All of these various forms of analysis can be performed by observing the communicative behavior of a speaker. Sentiment analysis, emotion recognition, and speaker trait recognition can be done at video level, utterance level or sentence level.

The MultiComp lab has developed multiple datasets over the years to enable the studies targeting multimodal sentiment analysis, emotion recognition, and speaker traits recognition. CMU-MOSI2 is the largest dataset of multimodal sentiment analysis and emotion recognition at the sentence level. This dataset will be publicly available in February 2018. CMU-MOSI is a dataset of opinion-level sentiment intensity analysis. EMO-REACT is a child emotion recognition dataset. POM is a dataset of multimodal sentiment analysis and speaker traits recognition. ICT-MMMO is a dataset of video-level multimodal sentiment analysis. YouTube is a dataset of utterance-level multimodal sentiment analysis. MOUD is a dataset of utterance-level multimodal sentiment analysis in Spanish.