S. Ghosh, E. Laksana, L.-P. Morency and S. Scherer, Representation Learning for Speech Emotion Recognition, In Proceedings of the Annual Conference of the International Speech Communication Association (Interpseech), 2016
One of the greatest challenges of multimodal data is to summarize the information from multiple modalities (or views) in a way that complementary information is used as a conglomerate while filtering out the redundant parts of the modalities. Due to the heterogeneity of the data, some challenges naturally spring up including different kinds of noise, alignment of modalities (or views) and, techniques to handle missing data. We study multimodal representations using two broad approaches, Joint and Coordinated.
Joint Representations involve projecting all the modalities to a common space while preserving information from the given modalities. Data from all modalities is required at training and inference time which can potentially make dealing with missing data hard. In our study, we propose a recurrent model which can fuse different views of a modality at each time-step and finally use the joint representation to complete the task at hand (like classification, regression, etc.).
Coordinated Representations involve projecting all the modalities to their space, but those spaces are coordinated using a constraint. This kind of an approach is more useful for modalities which are fundamentally very different and might not work well in a joint space. Due to the variety of modalities in nature, Coordinated Representations have a huge advantage over Joint Representations which gives us reason to believe that the coordination using constraints is the way to go in the field of multimodal representation.
Psychologists believe that facial expressions and verbal messages are some of the primary channels of human communication. In recent years, automatic emotion recognition has received considerable attention. The development of technologies in emotion recognition is surprisingly fast but requires further research.
At the early stage, researchers focused mostly on emotion analysis from single static facial images under constrained circumstances. The recognition in the real world is certainly quite different. As human emotions are dynamic streaming, the research is turning into recognition through video or image sequences. In our work, we try to develop multimodal machine learning methods for static and temporal emotion and affect recognition.