Joint Fusion and Encoding: Advancing Multimodal Retrieval from the   Ground Up

Lang Huang; Qiyu Wu; Zhongtao Miao; Toshihiko Yamasaki

arXiv:2502.20008·cs.CV·February 28, 2025

Joint Fusion and Encoding: Advancing Multimodal Retrieval from the Ground Up

Lang Huang, Qiyu Wu, Zhongtao Miao, Toshihiko Yamasaki

PDF

Open Access

TL;DR

This paper introduces a unified multimodal retrieval framework that fuses visual and textual data early in the process, significantly improving retrieval accuracy for complex, multi-modal queries by enabling richer cross-modal interactions.

Contribution

It proposes a ground-up fusion approach with a two-stage training process, advancing beyond late-fusion architectures for more effective multimodal retrieval.

Findings

01

Outperforms traditional methods in diverse retrieval scenarios.

02

Early fusion yields greater improvements on modality fusion tasks.

03

Two-stage training enhances model adaptation and instruction tuning.

Abstract

Information retrieval is indispensable for today's Internet applications, yet traditional semantic matching techniques often fall short in capturing the fine-grained cross-modal interactions required for complex queries. Although late-fusion two-tower architectures attempt to bridge this gap by independently encoding visual and textual data before merging them at a high level, they frequently overlook the subtle interplay essential for comprehensive understanding. In this work, we rigorously assess these limitations and introduce a unified retrieval framework that fuses visual and textual cues from the ground up, enabling early cross-modal interactions for enhancing context interpretation. Through a two-stage training process--comprising post-training adaptation followed by instruction tuning--we adapt MLLMs as retrievers using a simple one-tower architecture. Our approach outperforms…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Topic Modeling

MethodsAttentive Walk-Aggregating Graph Neural Network