The beauty of the series of work is to combine statistical methods with multimodal machine learning problems. The inherent statistical property gives the model more interpretability/explanations and guaranteed bounds. We employ probabilistic graphical models or statistical kernel methods for multimodal generation, multimodal time-series fusion, and modeling uncertainty in the multimodal environment.

In the example, we present a model that can 1) learn complex intra-modal and cross-modal interactions for prediction and 2) be robust to unexpected missing or noisy modalities during testing. The model factorizes representations into two sets of independent factors: multimodal discriminative and modality-specific generative factors for tackling a joint generative-discriminative objective across multimodal data and labels.

Y.-H.H. Tsai*, P.P. Liang*, A. Zadeh, L.-P. Morency, R. Salakhutdinov. Learning Multimodal Representations using Factorized Deep Generative Models. NeurIPS 2018 Workshop on Bayesian Deep Learning