LTI-11777: Multimodal Machine Learning



Multimodal machine learning (MMML) is a vibrant multi-disciplinary research field which addresses some of the original goals of artificial intelligence by integrating and modeling multiple communicative modalities, including linguistic, acoustic and visual messages. With the initial research on audio-visual speech recognition and more recently with language & vision projects such as image and video captioning, this research field brings some unique challenges for multimodal researchers given the heterogeneity of the data and the contingency often found between modalities. This course will teach fundamental mathematical concepts related to MMML including multimodal alignment and fusion, heterogeneous representation learning and multi-stream temporal modeling. We will also review recent papers describing state-of-the-art probabilistic models and computational algorithms for MMML and discuss the current and upcoming challenges.

The main technical topics are: (1) multimodal representation learning, including multimodal auto-encoder and deep learning, (2) multimodal component analysis and fusion, including deep canonical correlation analysis and multi-kernel learning, (3) multimodal alignment and multi-stream modeling, including attention models and multimodal recurrent neural networks, and (4) multimodal graphical models, including continuous and fully-connected conditional random fields. The course will also discuss many of the recent applications of MMML including multimodal affect recognition, image and video captioning and cross-modal multimedia retrieval.


Course Topics


** Exact topics may change based on student interests and time restrictions. **

Classes Lectures
Week 1


Course introduction

· Research and technical challenges

· Multimodal applications and datasets

Week 2


Basic mathematical concepts

· Language, image and audio representations

· Loss functions and basic neural networks

Week 3


Convolutional neural networks and optimization

· Neural network optimization

· Convolutional neural networks

Week 4


Recurrent neural networks

· Backpropagation Through Time

· Gated networks and LSTM

Week 5


Multimodal representation learning

· Multimodal auto-encoders

· Multimodal joint representations

Week 6 First project assignment – Presentations
Week 7


Multivariate statistics and coordinated representations

· Deep canonical correlation analysis

· Non-negative matrix factorization

Week 8


Multimodal alignment and attention models

· Explicit alignment and dynamic time warping

· Implicit alignment and attention models

Week 9


Multimodal optimization

· Practical deep model optimization

· Variational approaches

Week 10


Probabilistic graphical models

· Boltzmann distribution and CRFs

· Continuous and fully-connected CRFs

Week 11 Mid-term project assignment – Presentations
Week 12


Multimodal fusion and new directions

· Multi-kernel learning and fusion

· New directions in multimodal machine learning

Week 13 Thanksgiving week (+ Project preparation)
Week 14 Advanced multimodal representations

· Image and video description

· Guest lecture

Week 15 Final project assignment – Presentations


Examples of Reading Materials


  • Bengio, Yoshua, Aaron Courville, and Pascal Vincent. Representation learning: A review and new perspectives. IEEE transactions on pattern analysis and machine intelligence 35.8 (2013): 1798-1828.

  • Zeiler, Matthew D., and Rob Fergus. Visualizing and understanding convolutional networks. European Conference on Computer Vision. Springer International Publishing, 2014.

  • Andrea Frome, Greg S. Corrado, Jonathon Shlens, Samy Bengi, Jeffrey Dean, Marc’Aurelio Ranzato, Tomas Mikolov. DeViSE: A Deep Visual-Semantic Embedding Model.Advances in neural information processing systems (NIPS), 2013.

  • Kulkarni, Girish, et al., Babytalk: Understanding and generating simple image descriptions. IEEE Transactions on Pattern Analysis and Machine Intelligence 35.12 (2013): 2891-2903.

  • Lisa Anne Hendricks, Subhashini Venugopalan, Marcus Rohrbach, Raymond Mooney, Kate Saenko and Trevor Darrell, Deep Compositional Captioning: Describing Novel Object Categories without Paired Training Data. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2016.

  • Ammar, Waleed, et al., Massively multilingual word embeddings. arXiv preprint arXiv:1602.01925 (2016).