X-Former: Unifying Contrastive and Reconstruction Learning for MLLMs

Sirnam Swetha; Jinyu Yang; Tal Neiman; Mamshad Nayeem Rizve; Son Tran,; Benjamin Yao; Trishul Chilimbi; Mubarak Shah

arXiv:2407.13851·cs.CV·July 22, 2024·1 cites

X-Former: Unifying Contrastive and Reconstruction Learning for MLLMs

Sirnam Swetha, Jinyu Yang, Tal Neiman, Mamshad Nayeem Rizve, Son Tran,, Benjamin Yao, Trishul Chilimbi, Mubarak Shah

PDF

Open Access

TL;DR

X-Former is a novel transformer module that unifies contrastive and reconstruction learning to enhance visual representations in multimodal large language models, leading to improved detailed visual understanding.

Contribution

It introduces X-Former, a lightweight transformer that combines contrastive and masked image modeling for better visual features in MLLMs, with an innovative interaction mechanism.

Findings

01

Outperforms existing models on GQA visual reasoning tasks.

02

Achieves superior results on fine-grained visual perception benchmarks.

03

Enhances detailed visual understanding in multimodal models.

Abstract

Recent advancements in Multimodal Large Language Models (MLLMs) have revolutionized the field of vision-language understanding by integrating visual perception capabilities into Large Language Models (LLMs). The prevailing trend in this field involves the utilization of a vision encoder derived from vision-language contrastive learning (CL), showing expertise in capturing overall representations while facing difficulties in capturing detailed local patterns. In this work, we focus on enhancing the visual representations for MLLMs by combining high-frequency and detailed visual representations, obtained through masked image modeling (MIM), with semantically-enriched low-frequency representations captured by CL. To achieve this goal, we introduce X-Former which is a lightweight transformer module designed to exploit the complementary strengths of CL and MIM through an innovative…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling · Speech and dialogue systems

MethodsFocus · Mutual Information Machine/Mask Image Modeling · Contrastive Learning