Combating Visual Neglect and Semantic Drift in Large Multimodal Models for Enhanced Cross-Modal Retrieval
Guosheng Zhang, Linkai Liu, Keyao Wang, Haixiao Yue, Zhiwen Tan, Xiao Tan

TL;DR
This paper introduces SSA-ME, a framework that improves large multimodal models by emphasizing salient subjects and aligning cross-modal attention, leading to better retrieval accuracy.
Contribution
It proposes a novel saliency-aware embedding method that enhances subject-level semantics and balances visual-textual integration in multimodal models.
Findings
Achieves state-of-the-art results on MMEB benchmark.
Improves semantic alignment of salient regions in image-text retrieval.
Enhances interpretability of multimodal attention mechanisms.
Abstract
Despite significant progress in Unified Multimodal Retrieval (UMR) powered by Large Multimodal Models (LMMs), existing embedding methods primarily focus on sample-level objectives via contrastive learning while overlooking the crucial subject-level semantics. This limitation hinders the model's ability to group semantically coherent subjects in complex multimodal queries, manifesting as semantic alignment deviation--where models fail to accurately localize salient text-referred regions in visual content. Moreover, without explicit guidance to model salient visual subjects, LMMs tend to over-rely on textual cues, resulting in visual modality neglect and suboptimal utilization of visual knowledge. To this end, we propose Salient Subject-Aware Multimodal Embedding (SSA-ME), a novel framework designed to enhance fine-grained representation learning through saliency-aware modeling. SSA-ME…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
