TL;DR
The paper introduces AGREE, a framework that uses cross-modal attention from multimodal large language models to improve fine-grained relevance modeling in visual document retrieval, leading to better interpretability and performance.
Contribution
AGREE leverages attention maps from MLLMs as proxy supervision to guide retrievers in identifying relevant document regions, enhancing fine-grained relevance understanding.
Findings
AGREE outperforms baseline by 12.82% in nDCG@1.
AGREE improves nDCG@5 by 5.03%.
Qualitative analysis shows deeper query-region alignment.
Abstract
Visual document retrieval requires understanding heterogeneous and multi-modal content to satisfy implicit information needs. Recent advances use screenshot-based document encoding with fine-grained late interaction to encode holistic information and capture nuanced alignments, significantly improving retrieval performance. However, retrievers are still trained with coarse global relevance labels, without revealing which regions support the match. As a result, retrievers tend to rely on surface-level cues and struggle to capture implicit semantic connections, hindering their ability to handle non-extractive queries.To improve fine-grained relevance modeling, we propose a Attention-Grounded REtriever Enhancement (AGREE) framework. AGREE leverages cross-modal attention from multimodal large language models (MLLMs) as proxy supervision to guide the retriever in identifying relevant…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
