Vision-Guided Chunking Is All You Need: Enhancing RAG with Multimodal Document Understanding
Vishesh Tripathi, Tanmay Odapally, Indraneel Das, Uday Allu, Biddwan Ahmed

TL;DR
This paper introduces a multimodal document chunking method using Large Multimodal Models to improve RAG systems' ability to understand complex PDF documents with multi-page tables, figures, and structural dependencies.
Contribution
The paper presents a novel vision-guided chunking approach that enhances RAG by preserving document structure and semantics across pages, outperforming traditional text-based methods.
Findings
Improved chunk quality and downstream RAG performance.
Better preservation of document structure and semantic coherence.
Superior accuracy over vanilla RAG systems.
Abstract
Retrieval-Augmented Generation (RAG) systems have revolutionized information retrieval and question answering, but traditional text-based chunking methods struggle with complex document structures, multi-page tables, embedded figures, and contextual dependencies across page boundaries. We present a novel multimodal document chunking approach that leverages Large Multimodal Models (LMMs) to process PDF documents in batches while maintaining semantic coherence and structural integrity. Our method processes documents in configurable page batches with cross-batch context preservation, enabling accurate handling of tables spanning multiple pages, embedded visual elements, and procedural content. We evaluate our approach on a curated dataset of PDF documents with manually crafted queries, demonstrating improvements in chunk quality and downstream RAG performance. Our vision-guided approach…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHandwritten Text Recognition Techniques · Multimodal Machine Learning Applications · Topic Modeling
