X-Former: Unifying Contrastive and Reconstruction Learning for MLLMs
Sirnam Swetha, Jinyu Yang, Tal Neiman, Mamshad Nayeem Rizve, Son Tran,, Benjamin Yao, Trishul Chilimbi, Mubarak Shah

TL;DR
X-Former is a novel transformer module that unifies contrastive and reconstruction learning to enhance visual representations in multimodal large language models, leading to improved detailed visual understanding.
Contribution
It introduces X-Former, a lightweight transformer that combines contrastive and masked image modeling for better visual features in MLLMs, with an innovative interaction mechanism.
Findings
Outperforms existing models on GQA visual reasoning tasks.
Achieves superior results on fine-grained visual perception benchmarks.
Enhances detailed visual understanding in multimodal models.
Abstract
Recent advancements in Multimodal Large Language Models (MLLMs) have revolutionized the field of vision-language understanding by integrating visual perception capabilities into Large Language Models (LLMs). The prevailing trend in this field involves the utilization of a vision encoder derived from vision-language contrastive learning (CL), showing expertise in capturing overall representations while facing difficulties in capturing detailed local patterns. In this work, we focus on enhancing the visual representations for MLLMs by combining high-frequency and detailed visual representations, obtained through masked image modeling (MIM), with semantically-enriched low-frequency representations captured by CL. To achieve this goal, we introduce X-Former which is a lightweight transformer module designed to exploit the complementary strengths of CL and MIM through an innovative…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Speech and dialogue systems
MethodsFocus · Mutual Information Machine/Mask Image Modeling · Contrastive Learning
