Vision-Guided Chunking Is All You Need: Enhancing RAG with Multimodal Document Understanding

Vishesh Tripathi; Tanmay Odapally; Indraneel Das; Uday Allu; Biddwan Ahmed

arXiv:2506.16035·cs.LG·July 15, 2025

Vision-Guided Chunking Is All You Need: Enhancing RAG with Multimodal Document Understanding

Vishesh Tripathi, Tanmay Odapally, Indraneel Das, Uday Allu, Biddwan Ahmed

PDF

Open Access

TL;DR

This paper introduces a multimodal document chunking method using Large Multimodal Models to improve RAG systems' ability to understand complex PDF documents with multi-page tables, figures, and structural dependencies.

Contribution

The paper presents a novel vision-guided chunking approach that enhances RAG by preserving document structure and semantics across pages, outperforming traditional text-based methods.

Findings

01

Improved chunk quality and downstream RAG performance.

02

Better preservation of document structure and semantic coherence.

03

Superior accuracy over vanilla RAG systems.

Abstract

Retrieval-Augmented Generation (RAG) systems have revolutionized information retrieval and question answering, but traditional text-based chunking methods struggle with complex document structures, multi-page tables, embedded figures, and contextual dependencies across page boundaries. We present a novel multimodal document chunking approach that leverages Large Multimodal Models (LMMs) to process PDF documents in batches while maintaining semantic coherence and structural integrity. Our method processes documents in configurable page batches with cross-batch context preservation, enabling accurate handling of tables spanning multiple pages, embedded visual elements, and procedural content. We evaluate our approach on a curated dataset of PDF documents with manually crafted queries, demonstrating improvements in chunk quality and downstream RAG performance. Our vision-guided approach…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHandwritten Text Recognition Techniques · Multimodal Machine Learning Applications · Topic Modeling