Unsupervised Text Recap Extraction for TV Series

Unsupervised Text Recap Extraction for TV Series

Yu, S. Zhang and L.-P. Morency, Unsupervised Text Recap Extraction for TV Series, In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP), 2016


Summarization is widely explored in natural language that aims to create a summary that retains the most important points of the original document. In multimodal summarization, we learn both the hierarchical feature representation to capture the high-level concepts and the interaction between text and video contents.

Hierarchical Sequence Summarization: Utilizing the success of hierarchical feature representation in various computer vision tasks, we build up a hierarchy dynamically and recursively by alternating sequence learning and sequence summarization for studying action recognition.

Summarization for TV Script: Sequences found at the beginning of TV shows help the audience absorb the essence of previous episodes, and grab their attention with upcoming plots. We study the TV recap summarization that distinguishes from the traditional text summarization as we expect the summary to capture the duality of summarization and plot contingency between adjacent episodes.

One of the long-standing goals of artificial intelligence research is enabling humans to communicate with machines using natural language interfaces. 
The fundamental problem of achieving this is grounded language learning. Grounded language learning is the task of learning the meaning of natural language units (e.g., utterances, phrases, or words) by leveraging the sensory data (e.g., an image). Grounded language learning is a challenging task from a computational perspective due to the inherent ambiguity in natural language and the imperfect sensory data.

We study grounded language learning in the context of learning an interpretable model for referring expressions, i.e., localizing a visual object described by a natural language expression. We propose GroundNet, a dynamic neural architecture for localizing objects mentioned in a referring expression for an image.  Our model takes advantage of natural language compositionality to improve interpretability but can maintain high predictive accuracy. Critically, our approach relies on a syntactic analysis of the input referring expression to shape the computation graph. We find that this form of inductive bias helpfully constrains the learned model’s interpretation, but proves not to be overly restrictive. An important intermediate step for grounding referring expressions is the localization of supporting object mentions. Our experiments on the GoogleRef dataset show that GroundNet successfully identifies intermediate supporting objects while maintaining comparable performance to state-of-the-art approaches.