One of the long-standing goals of machine learning is enabling models to learn without human supervision. The fundamental way of achieving this is through unsupervised learning. We believe that Self-Supervised Learning (SSL), a type of unsupervised learning, is one of the most promising ways to learn representations and make inferences from the world without human supervision. Self-supervised learning enables ML models to learn from orders of magnitude more data, which is important to recognize and understand patterns of more subtle, less common representations of the world.
In our lab, we study how SSL can benefit from multimodal input. Our model takes advantage of vision, natural language, and speech data. We hope to study the performances, effect, interpretability, and robustness of self-supervised learning models for multimodal input and task, especially the unique information from each modality and interactions between different modalities.