Text-centric Alignment for Multi-Modality Learning
Yun-Da Tsai, Ting-Yu Yen, Pei-Fu Guo, Zhe-Yan Li, Shou-De Lin

TL;DR
This paper introduces TAMML, a novel text-centric alignment method using foundation models and LLMs to improve multimodal learning's adaptability to modality mismatch during inference.
Contribution
It presents a new approach leveraging text as a universal semantic space to enhance multimodal systems' generalizability and robustness under modality mismatch conditions.
Findings
TAMML significantly improves handling unseen modality combinations.
The method maintains robust performance across diverse modality scenarios.
It demonstrates the potential of foundation models in flexible multimodal learning.
Abstract
This research paper addresses the challenge of modality mismatch in multimodal learning, where the modalities available during inference differ from those available at training. We propose the Text-centric Alignment for Multi-Modality Learning (TAMML) approach, an innovative method that utilizes Large Language Models (LLMs) with in-context learning and foundation models to enhance the generalizability of multimodal systems under these conditions. By leveraging the unique properties of text as a unified semantic space, TAMML demonstrates significant improvements in handling unseen, diverse, and unpredictable modality combinations. TAMML not only adapts to varying modalities but also maintains robust performance, showcasing the potential of foundation models in overcoming the limitations of traditional fixed-modality frameworks in embedding representations. This study contributes to the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques
