Grand Challenge and Workshop on Human Multimodal Language – ACL 2018
Sponsored partially by National Science Foundation.
July 20th 2018, Melbourne Australia


[6/14/2018] The proceedings of the grand challenge are available. Please click here to download.

[4/17/2018] Deadline for submitting papers has been extended to April 27th.

[4/1/2018] Submission portal now open:

[4/1/2018] We are now accepting predictions on the test data. Once received we will email you with the results. Please submit your results to Zhun Liu ( and CC Amir Zadeh (

[3/9/2018] Grand challenge testing data is now released. Please check the CMU Multimodal SDK github

[1/18/2018] Grand challenge training data as well as CMU Multimodal Data SDK are now released. Please check the github

[1/05/2018] Workshop page is published



Title: Scaling up Sentiment and Emotion Analysis

Speaker: Bing Liu – University of Illinois at Chicago (UIC)

Abstract: Although sentiment and emotion analysis has been extensively researched and widely used in practice, the accuracy of the current solutions is still an issue. It is not too hard to achieve good accuracy in applications that involve only one or two domains (e.g., one category of products or services), but it is a major challenge to scale up the system to work accurately for thousands (e.g., BestBuy product categories) or tens of thousands of domains (e.g., Amazon product categories). This talk will examines some specific challenges (several of them have not been well studied) and possible solutions. Leveraging multimodal data is a promising direction in applications with such data. Another promising solution is lifelong learning (LL), which aims to enable the machine to learn like humans, i.e., learning continuously, retaining the learned knowledge and using/adapting the knowledge to help learn new tasks/domains. I will identify several characteristics of sentiment analysis that make it very suitable for LL and discuss how LL may be applied to subtasks of sentiment analysis based on textual data or multimodal data. Along with it, several other relevant techniques will also be covered.


Bing Liu is a distinguished professor of Computer Science at the University of Illinois at Chicago. He received his Ph.D. in Artificial Intelligence from the University of Edinburgh. His research interests include sentiment analysis, lifelong learning, natural language processing (NLP), data mining, and machine learning. He has published extensively in top conferences and journals. Two of his papers have received 10-year Test-of-Time awards from KDD. He also authored four books: two on sentiment analysis, one on lifelong learning, and one on Web mining. Some of his work has been widely reported in the press, including a front-page article in the New York Times. On professional services, he served as the Chair of ACM SIGKDD (ACM Special Interest Group on Knowledge Discovery and Data Mining) from 2013-2017. He also served as program chair of many leading data mining conferences, including KDD, ICDM, CIKM, WSDM, SDM, and PAKDD, as associate editor of leading journals such as TKDE, TWEB, DMKD and TKDD, and as area chair or senior PC member of numerous NLP, AI, Web, and data mining conferences. He is a Fellow of ACM, AAAI and IEEE.



Title: Multimodal Language and Behavior during Human Learning

Speaker: Sharon Oviatt – Monash University

Abstract: This talk will introduce how multimodal language and activity patterns help us think, and also what they tell us about human learning and expertise. I’ll walk through examples of student behavior during phases of group problem-solving, including when they collaborate versus work solo, when they shift their use of different modalities like speech and writing, and how they actively coordinate joint problem-solving using communication mechanisms like spoken interruptions. From a learning analytics viewpoint, I’ll also discuss what a student’s spoken interruptions and multimodal energy expenditure (both communicative and physical) reveal about their relative expertise in a domain like mathematics.

Professor Sharon Oviatt is internationally known for her work on human-centered interfaces, multimodal-multisensor interfaces, mobile interfaces, educational interfaces, the cognitive impact of computer input tools, and behavioral analytics. She originally received her PhD in Experimental Psychology at the University of Toronto, and she has been a professor of Computer Science, Information Technology, Psychology, and also Linguistics. Her research is known for its pioneering and multidisciplinary style at the intersection of Computer Science, Psychology, Linguistics, and Learning Sciences. Sharon has been recipient of the inaugural ACM-ICMI Sustained Accomplishment Award, National Science Foundation Special Creativity Award, ACM-SIGCHI CHI Academy Award, and an ACM Fellow Award. She has published a large volume of high-impact papers in a wide range of multidisciplinary venues (Google Scholar citations >11,000; h-index 47) and is an Associate Editor of the main journals and edited book collections in the field of human-centered interfaces. Her recent books include The Design of Future Educational Interfaces (2013, Routledge Press), The Paradigm Shift to Multimodality in Contemporary Computer Interfaces (2015, Morgan-Claypool), and the multi-volume Handbook of Multimodal-Multisensor Interfaces (co-edited with Bjoern Schuller, Phil Cohen, Anthony Krueger, Gerasimos Potamianos and Daniel Sonntag, 2017-2018, ACM Books). She currently is Director of Human-Computer Interaction and Creative Technologies in Information Technology at Monash University in Melbourne.


Title: Multimodal Affective Computing and Healthcare Applications

Speaker:Roland Goecke – University of Canberra

Abstract: In this talk, Roland Goecke will give an overview of his research into developing multimodal technology that analyses the affective state and more broadly behaviour of humans. Such technology is useful for a number of applications, with applications in healthcare, e.g. mental health disorders, being a particular focus for his research group. Depression and other mood disorders are common and disabling disorders. Their impact on individuals and families is profound. The WHO Global Burden of Disease reports quantify depression as the leading cause of disability worldwide. Despite the high prevalence, current clinical practice depends almost exclusively on self-report and clinical opinion, risking a range of subjective biases. There currently exist no laboratory-based measures of illness expression, course and recovery, and no objective markers of end-points for interventions in both clinical and research settings. Using a multimodal analysis of facial expressions and movements, body posture, head movements as well as vocal expressions, he is developing affective sensing technology that supports clinicians in the diagnosis and monitoring of treatment progress. Encouraging results from a recently completed pilot study demonstrate that this approach can achieve over 90% agreement with clinical assessment. After ten years of research, he will also talk about the lessons learnt in this project, such as measuring spontaneous expressions of affect, subtle expressions, and affect intensity using multimodal approaches. He is currently extending this line of research to other disorders such as anxiety, post-traumatic stress disorder, dementia and autism spectrum disorders. In particular for the latter, a natural progression is to analyse dyadic and group social interactions through multimodal behaviour analysis. At the core of our research is a focus on robust approaches that can work in real-world environments.

Roland Goecke is Professor of Affective Computing at the University of Canberra, Australia, where he leads the Human-Centred Technology Research Centre. He received his Masters degree in Computer Science from the University of Rostock, Germany, in 1998 and his PhD in Computer Science from the Australian National University, Canberra, Australia, in 2004. Before joining UC in December 2008, Prof Goecke worked as a Senior Research Scientist with start-up Seeing Machines, as a Researcher at the NICTA Canberra Research Labs, and as a Research Fellow at the Fraunhofer Institute for Computer Graphics, Germany. His research interests are in affective computing, pattern recognition, computer vision, human-computer interaction, multimodal signal processing and e-research. Prof Goecke has been an author and co-author of over 130 peer-reviewed publications. His research has been funded by grants from the Australian Research Council (ARC), the National Health and Medical Research Council (NHMRC), the National Science Foundation (NSF), the Australian National Data Service (ANDS) and the National eResearch Collaboration Tools and Resources project (NeCTAR).


Table of Contents

1. Overview and Scope
2. Datasets and CMU Multimodal Data SDK
3. Submission Information
4. Schedule and Important Dates
5. Suggested Articles


Overview and Scope


Computational analysis of human multimodal language is an emerging research area in Natural Language Processing (NLP). It expands the horizons of NLP to study language used in face to face communication and in online multimedia. This form of language contains modalities of language (in terms of spoken text), visual (in terms of gestures and facial expressions) and acoustic (in terms of changes in the voice tone). At its core, this research area is focused on modeling the three modalities and their complex interactions. The first Grand Challenge and Workshop on Human Multimodal Language aims to facilitate the growth of this new research direction in NLP community. The grand challenge is focused on multimodal sentiment analysis and emotion recognition on the recently introduced CMU Multimodal Opinion Sentiment and Emotion Intensity (CMU-MOSEI) dataset. The grand-challenge will be held in conjunction with the 56th Annual Meeting of the Association for Computational Linguistics 2018.

Communicating using multimodal language (verbal and nonverbal) shares a significant portion of our communication including face-to-face communication, video chatting, and social multimedia opinion sharing. Hence, it’s computational analysis is centric to NLP research. The challenges of modeling human multimodal language can be split into two major categories: 1) studying each modality individually and modeling each in a manner that can be linked to other modalities (also known as intramodal dynamics) 2) linking the modalities by modeling the interactions between them (also known as intermodal dynamics). Common forms of these interactions include complementary or correlated information across modes. Intrinsic to each modality, modeling human multimodal language is complex due to factors such as idiosyncrasy in communicative styles, non-trivial alignment between modalities and unreliable or contradictory information across modalities. Therefore computational analysis becomes a challenging research area.

The focus of this workshop is on joint analysis of language (spoken text), vision (gestures and expressions) and acoustic (paralingustic) modalities. We seek the following types of submissions:

  • Grand challenge papers: Papers summarizing the research effort with the CMU-MOSEI shared task on multimodal sentiment analysis and emotion recognition. Grand challenge papers are 8 pages, including references.
  • Full and short papers: These papers are presenting substantial, original and unpublished research on human multimodal language. Full papers are up to 8 pages including references and short papers are 4 pages + 1 page for references.

Topics of interest for full and short papers include:

  • • Multimodal sentiment analysis
  • • Multimodal emotion recognition
  • • Multimodal affective computing
  • • Multimodal speaker traits recognition
  • • Dyadic multimodal interactions
  • • Multimodal dialogue modeling
  • • Cognitive modeling and multimodal interaction
  • • Statistical analysis of human multimodal language



Grand Challenge Dataset

The grand challenge dataset in this workshop is the newly introduced  CMU Multimodal Opinion Sentiment and Emotion Intensity (CMU-MOSEI) dataset. CMU-MOSEI is the largest dataset of multimodal sentiment analysis and emotion recognition to date. The dataset contains more than 23,000 sentence utterance videos from more than 1000 online YouTube speakers. Extensive details of the dataset are available here.

We strongly recommend reading material presented in the reference section of this page to learn about current state of the art in multimodal sentiment analysis and emotion recognition on datasets that involves all three modalities.

Other Datasets

Submission are accepted from other multimodal human language datasets. These datasets are not parts of the grand challenge. The following candidate datasets are already included in CMU Multimodal Data SDK:

  1. 1. CMU Multimodal Opinion Sentiment Intensity (CMU-MOSI)
  2. 2. Persuasion Opinion Multimodal (POM)
  3. 3. ICT-MMMO
  4. 4. YouTube
  5. 5. MOUD (Spanish multimodal sentiment analysis)
  6. 6. IEMOCAP (to download please contact IEMOCAP creators)


This workshop is organized in a manner that ensures anyone in NLP community as well as other related machine learning communities can easily participate without being overwhelmed. As organizers of this workshop we are aware that dealing with information from multiple modalities is intimidating and ultimately frustrating for first time participants in this research area. Therefore we built the CMU Multimodal Data SDK that allows the data loading to be as easy as a few lines of python code. CMU Multimodal Data SDK is agnostic to the dataset and loads all the preprocessed features from all modalities in a simple recurrent form that is used by TensorFlow and PyTorch. Users can access the CMU Multimodal Data SDK on January 18th by checking this website again (the link will appear here).

Following features are available through the CMU Multimodal Data SDK:

Language: word vectors from GloVe as well as 1-hot word representations. The duration of each word utterance is also provided by the P2FA forced alignment.

Vision: Facial action units which are the indicators for facial muscle movements as well as 6 basic emotions regressed from just facial features. The position of facial landmarks is also present in the vision features. These features are sampled at 30hz.

Acoustic: From the audio modality, the software COVAREP is used to extract acoustic features including 12 Mel-frequency cepstral coefficients, pitch tracking and voiced/unvoiced segmenting features, glottal source parameters, peak slope parameters and maxima dispersion quotients are extracted. The sampling rate of these features is 100hz.


While the modalities have different sampling rates, CMU Multimodal Data SDK is able to align all of them to a desired modality and get similar shape inputs. Please refer to the documentation for more information.


Submission Information

Grand Challenge Papers

All the submissions must be in the ACL 2018 style format (latex)(word). All submission should use at least two of three modalities of language, vision and acoustics. Submission that use only one modality will not be accepted as oral presentations. Submissions may be between 6-8 pages of content with infinite number of pages for references. We plan on accepting 12 papers for oral sessions and up to 15 papers for posters. We invite the participants to explore new methods of embedding each modality as well as fusing information across them. For a comprehensive read on state of the art, we invite you to look at the suggested articles at the bottom of this page.

The submissions are encouraged to report results for regression and classification on CMU-MOSEI test set. The number of classes varies based on labels in the dataset. The details for each measure is available in the CMU Multimodal Data SDK github. Please note that each Grand Challenge team can only submit the test set predictions less than 5 times. The results on test set must be exactly similar to the format in the CMU Multimodal Data SDK for the labels. If your results have formatting issues we ask you to submit the results again. The results must not be submitted later than April 16th. Test set predictions can only be submitted via email to Zhun Liu ( via a title “ACL2018 Grand Challenge Predictions for $team_name” where $team_name is replaced with your team name. All teams must choose one name.

If the participants choose to not report their results on the grand challenge test set, their paper may be rejected since the priority will be given to papers that include test set results. The participants who choose not to submit their results on test set will not be considered for rankings. They may still report their results on the validation sets provided with the training sets. Papers that don’t report any metric on neither test set nor validation set are automatically as workshop papers.


Workshop Papers

All the submissions must be in the ACL 2018 style format (latex)(word). Full papers may be up to 8 pages while short papers may be up to 4 pages. All submission should use at least two of three modalities of language, vision and acoustics on a publicly available dataset (examples given in other datasets in Datasets section). Workshop papers should contain original ideas or novel contributions. They may not necessarily need to be a completed research with full results. We also accept papers that present partial results as long as the idea is original. Papers that study the datasets from a statistical perspective are also accepted. These papers may decide not to report classification or regression accuracies but rather bring statistical intuitions about the nature of human multimodal language. Full workshop papers will be given the same priority as grand challenge papers for oral presentation.



9:00-9:10: Opening Remarks

9:10-10:00: Keynote Bing Liu

10:00-10:10: [Oral GC] Getting the subtext without the text: Scalable multimodal sentiment classification from visual and acoustic modalities.

10:10-10:20: [Oral GC] Recognizing Emotions in Video Using Multimodal DNN Feature Fusion.

10:20-10:30: [Oral GC] Multimodal Relational Tensor Network for Sentiment and Emotion Classification.

10:30-11:00: Coffee Break

11:00-11:50: Keynote Sharon Oviatt

11:50-12:00: Advances in Multimodal Datasets

12:00-12:10: [Oral GC] Convolutional Attention Networks for Multimodal Emotion Recognition from Speech and Text Data

12:10-12:20: [Oral WS] Sentiment Analysis using Imperfect Views from Spoken Language and Acoustic Modalities

12:20-12:30: [Oral WS] Polarity and Intensity: the Two Aspects of Sentiment Analysis

13:30-14:20: Keynote Roland Goecke

14:20-14:30: [Oral WS] ASR-based Features for Emotion Recognition: A Transfer Learning Approach

14:30-14:40: [Oral WS] Seq2Seq2Sentiment: Multimodal Sequence to Sequence Models for Sentiment Analysis

14:40-14:50: [Oral WS] DNN Multimodal Fusion Techniques for Predicting Video Sentiment

14:50-15:00: Grand Challenge Results

15:00: Workshop Ends


Important Dates


The important dates for the workshop are as follows:

  • • CMU Multimodal Data SDK available: Jan 18th
  • • Train and Validation sets available: Jan 18th
  • • Test Set Input Features Available: March 9th (extended from March 1st)
  • • Paper Submission Deadline*: April 20th April 27th
  • • Notification of Acceptance: May 14th
  • • Camera Ready Deadline: May 28th
  • • Workshop Date: July 20th
  • • Workshop Location: ACL 2018, Melbourne, Australia

* Submit the test set predictions no later than April 24th. The accuracy on test set will be returned to you within a few hours. You can only submit once!


Organizing Committee

  1. Amir Zadeh – Language Technologies Institute, Carnegie Mellon University
  2. Louis-Philippe Morency – Language Technologies Institute, Carnegie Mellon University
  3. Paul Pu Liang – Machine Learning Department, Carnegie Mellon University
  4. Soujanya Poria – Temasek Laboratories, Nanyang Technological University
  5. Erik Cambria – Temasek Laboratories, Nanyang Technological University
  6. Zhun Liu – Language Technologies Institute, Carnegie Mellon University
  7. Stefan Scherer – Institute for Creative Technologies, University of Southern California

Special thanks to Zhun Liu, Varun Lakshminarasimhan and Ying Shen for the maintenance of CMU Multimodal Data SDK as well as Edmund Tong and Jonathan Vanbriesen for their help in creating CMU-MOSEI dataset.



Suggested Articles


  • Zadeh, A., Liang, P. P., Mazumder, N., Poria, S., Cambria, E., Morency, L. P. (2018) “Memory Fusion Network for Multi-view Sequential Learning”, Association for Advancements in Artificial Intelligence (AAAI)
  • Chen, M., Wang, S., Liang, P. P., Baltrušaitis, T., Zadeh, A., & Morency, L. P. (2017, November). “Multimodal sentiment analysis with word-level fusion and reinforcement learning”. In Proceedings of the 19th ACM International Conference on Multimodal Interaction (pp. 163-171). ACM.
  • Zadeh, A., Chen, M., Poria, S., Cambria, E., & Morency, L. P. (2017). “Tensor Fusion Network for Multimodal Sentiment Analysis”. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing (pp. 1114-1125).
  • Poria, S., Cambria, E., Hazarika, D., Majumder, N., Zadeh, A., & Morency, L. P. (2017). “Context-dependent sentiment analysis in user-generated videos”. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) (Vol. 1, pp. 873-883).
  • Yu, H., Gui, L., Madaio, M., Ogan, A., Cassell, J., & Morency, L. P. (2017). “Temporally Selective Attention Model for Social and Affective State Recognition in Multimedia Content”. In ACM Multimedia (pp. 1743-1751).
  • Rajagopalan, S. S., Morency, L. P., Baltrusaitis, T., & Goecke, R. (2016, October). “Extending long short-term memory for multi-view structured learning”. In European Conference on Computer Vision (pp. 338-353). Springer International Publishing.
  • Wang, H., Meghawat, A., Morency, L. P., & Xing, E. P. (2017, July). “Select-additive learning: Improving generalization in multimodal sentiment analysis”. In Multimedia and Expo (ICME), 2017 IEEE International Conference on (pp. 949-954). IEEE.
  • Nojavanasghari, B., Gopinath, D., Koushik, J., Baltrušaitis, T., & Morency, L. P. (2016, October). “Deep multimodal fusion for persuasiveness prediction”. In Proceedings of the 18th ACM International Conference on Multimodal Interaction (pp. 284-288). ACM.
  • Baltrušaitis, T., Ahuja, C., & Morency, L. P. (2017). “Multimodal Machine Learning: A Survey and Taxonomy”. arXiv preprint arXiv:1705.09406.