VISOR: Agentic Visual Retrieval-Augmented Generation via Iterative Search and Over-horizon Reasoning
Yucheng Shen, Jiulong Wu, Jizhou Huang, Dawei Yin, Lingyong Yan, Min Cao

TL;DR
VISOR is a unified agentic framework that enhances visual retrieval-augmented generation by addressing cross-page reasoning and search drift through structured evidence space and dynamic search strategies.
Contribution
The paper introduces VISOR, a novel single-agent system with mechanisms for structured reasoning and search drift mitigation, achieving state-of-the-art results in visual reasoning tasks.
Findings
VISOR outperforms existing methods on ViDoSeek, SlideVQA, and MMLongBench datasets.
The structured Evidence Space improves cross-page reasoning accuracy.
Dynamic Trajectory with Sliding Window reduces search drift and maintains context relevance.
Abstract
Visual Retrieval-Augmented Generation (VRAG) empowers Vision-Language Models to retrieve and reason over visually rich documents. To tackle complex queries requiring multi-step reasoning, agentic VRAG systems interleave reasoning with iterative retrieval.. However, existing agentic VRAG faces two critical bottlenecks. (1) Visual Evidence Sparsity: key evidence is scattered across pages yet processed in isolation, hindering cross-page reasoning; moreover, fine-grained intra-image evidence often requires precise visual actions, whose misuse degrades retrieval quality; (2) Search Drift in Long Horizons: the accumulation of visual tokens across retrieved pages dilutes context and causes cognitive overload, leading agents to deviate from their search objective. To address these challenges, we propose VISOR (Visual Retrieval-Augmented Generation via Iterative Search and Over-horizon…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
