Combating Visual Neglect and Semantic Drift in Large Multimodal Models for Enhanced Cross-Modal Retrieval

Guosheng Zhang; Linkai Liu; Keyao Wang; Haixiao Yue; Zhiwen Tan; Xiao Tan

arXiv:2604.25273·cs.CV·April 29, 2026

Combating Visual Neglect and Semantic Drift in Large Multimodal Models for Enhanced Cross-Modal Retrieval

Guosheng Zhang, Linkai Liu, Keyao Wang, Haixiao Yue, Zhiwen Tan, Xiao Tan

PDF

TL;DR

This paper introduces SSA-ME, a framework that improves large multimodal models by emphasizing salient subjects and aligning cross-modal attention, leading to better retrieval accuracy.

Contribution

It proposes a novel saliency-aware embedding method that enhances subject-level semantics and balances visual-textual integration in multimodal models.

Findings

01

Achieves state-of-the-art results on MMEB benchmark.

02

Improves semantic alignment of salient regions in image-text retrieval.

03

Enhances interpretability of multimodal attention mechanisms.

Abstract

Despite significant progress in Unified Multimodal Retrieval (UMR) powered by Large Multimodal Models (LMMs), existing embedding methods primarily focus on sample-level objectives via contrastive learning while overlooking the crucial subject-level semantics. This limitation hinders the model's ability to group semantically coherent subjects in complex multimodal queries, manifesting as semantic alignment deviation--where models fail to accurately localize salient text-referred regions in visual content. Moreover, without explicit guidance to model salient visual subjects, LMMs tend to over-rely on textual cues, resulting in visual modality neglect and suboptimal utilization of visual knowledge. To this end, we propose Salient Subject-Aware Multimodal Embedding (SSA-ME), a novel framework designed to enhance fine-grained representation learning through saliency-aware modeling. SSA-ME…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.