ReAlign: Optimizing the Visual Document Retriever with Reasoning-Guided Fine-Grained Alignment
Hao Yang, Yifan Ji, Zhipeng Xu, Zhenghao Liu, Yukun Yan, Zulong Chen, Shuo Wang, Yu Gu, Ge Yu

TL;DR
ReAlign enhances visual document retrieval by leveraging reasoning-guided, fine-grained alignment using superior vision-language models to focus on crucial visual cues, improving retrieval accuracy across diverse datasets.
Contribution
The paper introduces ReAlign, a novel method that uses reasoning-guided supervision to improve the alignment of visual document representations with queries.
Findings
ReAlign achieves up to 2% relative improvement in retrieval performance.
The method generalizes across different VLM backbones.
ReAlign improves focus on critical visual cues for document representation.
Abstract
Visual document retrieval aims to retrieve a set of document pages relevant to a query from visually rich collections. Existing methods often employ Vision-Language Models (VLMs) to encode queries and visual pages into a shared embedding space, which is then optimized via contrastive training. However, during visual document representation, localized evidence is usually scattered across complex document layouts, making it difficult for retrieval models to capture crucial cues for effective embedding learning. In this paper, we propose Reasoning-Guided Alignment (ReAlign), a method that enhances visual document retrieval by leveraging the reasoning capability of VLMs to provide fine-grained visual document descriptions as supervision signals for training. Specifically, ReAlign employs a superior VLM to identify query-related regions on a page and then generates a query-aware description…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
