Summarization is widely explored in natural language that aims to create a summary that retains the most important points of the original document. In multimodal summarization, we learn both the hierarchical feature representation to capture the high-level concepts and the interaction between text and video contents.

Hierarchical Sequence Summarization: Utilizing the success of hierarchical feature representation in various computer vision tasks, we build up a hierarchy dynamically and recursively by alternating sequence learning and sequence summarization for studying action recognition.

Summarization for TV Script: Sequences found at the beginning of TV shows help the audience absorb the essence of previous episodes, and grab their attention with upcoming plots. We study the TV recap summarization that distinguishes from the traditional text summarization as we expect the summary to capture the duality of summarization and plot contingency between adjacent episodes.


H. Wang, A. Meghawat, L.-P. Morency and E. Xing. Select-Additive Learning: Improving Generalization in Multimodal Sentiment Analysis. In Proceedings of the IEEE International Conference on Multimedia & Expo (ICME), 2017

Yu, S. Zhang and L.-P. Morency, Unsupervised Text Recap Extraction for TV Series, In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP), 2016