Attention Grounded Enhancement for Visual Document Retrieval

Wanqing Cui; Wei Huang; Yazhi Guo; Yibo Hu; Meiguang Jin; Junfeng Ma; Keping Bi

arXiv:2511.13415·cs.IR·May 12, 2026

Attention Grounded Enhancement for Visual Document Retrieval

Wanqing Cui, Wei Huang, Yazhi Guo, Yibo Hu, Meiguang Jin, Junfeng Ma, Keping Bi

PDF

1 Repo

TL;DR

The paper introduces AGREE, a framework that uses cross-modal attention from multimodal large language models to improve fine-grained relevance modeling in visual document retrieval, leading to better interpretability and performance.

Contribution

AGREE leverages attention maps from MLLMs as proxy supervision to guide retrievers in identifying relevant document regions, enhancing fine-grained relevance understanding.

Findings

01

AGREE outperforms baseline by 12.82% in nDCG@1.

02

AGREE improves nDCG@5 by 5.03%.

03

Qualitative analysis shows deeper query-region alignment.

Abstract

Visual document retrieval requires understanding heterogeneous and multi-modal content to satisfy implicit information needs. Recent advances use screenshot-based document encoding with fine-grained late interaction to encode holistic information and capture nuanced alignments, significantly improving retrieval performance. However, retrievers are still trained with coarse global relevance labels, without revealing which regions support the match. As a result, retrievers tend to rely on surface-level cues and struggle to capture implicit semantic connections, hindering their ability to handle non-extractive queries.To improve fine-grained relevance modeling, we propose a Attention-Grounded REtriever Enhancement (AGREE) framework. AGREE leverages cross-modal attention from multimodal large language models (MLLMs) as proxy supervision to guide the retriever in identifying relevant…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

VickiCui/AGREE
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.