Predicting Co-verbal Gestures: A Deep and Temporal Modeling Approach

Predicting Co-verbal Gestures: A Deep and Temporal Modeling Approach

C.-C. Chiu, L.-P. Morency and S. Marsella. Predicting Co-verbal Gestures: A Deep and Temporal Modeling Approach. In Proceedings of the International Conference on Intelligent Virtual Agents (IVA), 2015.


Typical techniques for sequence modeling rely upon well-segmented sequences which have been edited to remove noisy or irrelevant parts. Therefore, we cannot easily apply such methods to noisy sequences expected in real-world applications.

In one of our projects, we study sequence modeling through the combination of RNNs that captures the temporal dependencies and the attention mechanism that localizes the salient observations which are relevant to the final decision and ignore the irrelevant (noisy) parts of the input sequence.

More recent work uses more powerful neural network models such as Transformers to process longer sequences. One of our more recent projects uses a hierarchical architecture to model multiple temporal resolutions of sequences, allowing representations of data with different degrees of granularity and more easily capturing long-range dependencies. This work is being applied to music generation, where modeling hierarchical structure is especially important.


The success of video-sharing and social network websites has led to an increased posting of online multimedia content, with a large proportion of these videos being human-centric. The sheer amount of such data promotes research on behavior understanding that can discover the affective and social states within human-centric multimedia content.  We can model personality and social interaction via temporal modeling and multimodal fusion.

Rapport: Rapport is a harmonious relationship in which people are coordinated and understand each other. The power of rapport in social interactions has inspired us to develop the intelligent virtual agent that induces the subjective feeling and many of the behavioral benefits of the psychological concept of rapport. Moreover, we develop the system of automatic detection for remote peer tutoring.

Persuasiveness: Persuasiveness is a high-level personality trait that quantifies the influence a speaker has on the beliefs, attitudes, intentions, motivations, and behavior of the audience. With social multimedia becoming an important channel in propagating ideas and opinions, analyzing persuasiveness is very important. Inspired by the success of deep learning techniques, we study the persuasiveness prediction with deep multimodal fusion that combines signals from the visual, acoustic, and text modalities effectively.