Beyond Patch Aggregation: 3-Pass Pyramid Indexing for Vision-Enhanced Document Retrieval

Anup Roy; Rishabh Gyanendra Upadhyay; Animesh Rameshbhai Panara; Robin Mills; Aidan Millar

arXiv:2511.21121·cs.IR·January 7, 2026

Beyond Patch Aggregation: 3-Pass Pyramid Indexing for Vision-Enhanced Document Retrieval

Anup Roy, Rishabh Gyanendra Upadhyay, Animesh Rameshbhai Panara, Robin Mills, Aidan Millar

PDF

Open Access

TL;DR

This paper introduces VisionRAG, a multimodal document retrieval system that directly indexes images with a three-pass pyramid approach, preserving layout and spatial cues while being efficient and scalable.

Contribution

It proposes a novel OCR-free, model-agnostic retrieval framework using a pyramid indexing method that combines global summaries and visual cues for improved document retrieval.

Findings

01

Achieves high accuracy and recall on financial document benchmarks.

02

Stores significantly fewer vectors per page compared to patch-based methods.

03

Maintains flexibility and efficiency across different multimodal encoders.

Abstract

Document centric RAG pipelines usually begin with OCR, followed by brittle heuristics for chunking, table parsing, and layout reconstruction. These text first workflows are costly to maintain, sensitive to small layout shifts, and often lose the spatial cues that contain the answer. Vision first retrieval has emerged as a strong alternative. By operating directly on page images, systems like ColPali and ColQwen preserve structure and reduce pipeline complexity while achieving strong benchmark performance. However, these late interaction models tie retrieval to a specific vision backbone and require storing hundreds of patch embeddings per page, creating high memory overhead and complicating large scale deployment. We introduce VisionRAG, a multimodal retrieval system that is OCR free and model agnostic. VisionRAG indexes documents directly as images, preserving layout, tables, and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHandwritten Text Recognition Techniques · Image Retrieval and Classification Techniques · Advanced Image and Video Retrieval Techniques