Joint Fusion and Encoding: Advancing Multimodal Retrieval from the Ground Up
Lang Huang, Qiyu Wu, Zhongtao Miao, Toshihiko Yamasaki

TL;DR
This paper introduces a unified multimodal retrieval framework that fuses visual and textual data early in the process, significantly improving retrieval accuracy for complex, multi-modal queries by enabling richer cross-modal interactions.
Contribution
It proposes a ground-up fusion approach with a two-stage training process, advancing beyond late-fusion architectures for more effective multimodal retrieval.
Findings
Outperforms traditional methods in diverse retrieval scenarios.
Early fusion yields greater improvements on modality fusion tasks.
Two-stage training enhances model adaptation and instruction tuning.
Abstract
Information retrieval is indispensable for today's Internet applications, yet traditional semantic matching techniques often fall short in capturing the fine-grained cross-modal interactions required for complex queries. Although late-fusion two-tower architectures attempt to bridge this gap by independently encoding visual and textual data before merging them at a high level, they frequently overlook the subtle interplay essential for comprehensive understanding. In this work, we rigorously assess these limitations and introduce a unified retrieval framework that fuses visual and textual cues from the ground up, enabling early cross-modal interactions for enhancing context interpretation. Through a two-stage training process--comprising post-training adaptation followed by instruction tuning--we adapt MLLMs as retrievers using a simple one-tower architecture. Our approach outperforms…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Topic Modeling
MethodsAttentive Walk-Aggregating Graph Neural Network
