Zero-Shot Document Understanding using Pseudo Table of Contents-Guided Retrieval-Augmented Generation
Hyeon Seong Jeong, Sangwoo Jo, Byeong Hyun Yoon, Yoonseok Heo, Haedong Jeong, Taehoon Kim

TL;DR
DocsRay is a training-free, multimodal document understanding system that uses pseudo-TOC generation and hierarchical retrieval to efficiently process complex documents with diverse elements, achieving high accuracy and reduced latency.
Contribution
Introduces DocsRay, a novel zero-shot document understanding framework combining pseudo-TOC generation with hierarchical retrieval, without requiring additional training or specialized models.
Findings
Reduced query latency by 45%
Achieved 64.7% accuracy on MMLongBench-Doc
Effectively processes multimodal documents with diverse elements
Abstract
Understanding complex multimodal documents remains challenging due to their structural inconsistencies and limited training data availability. We introduce \textit{DocsRay}, a training-free document understanding system that integrates pseudo Table of Contents (TOC) generation with hierarchical Retrieval-Augmented Generation (RAG). Our approach leverages multimodal Large Language Models' (LLMs) native capabilities to seamlessly process documents containing diverse elements such as text, images, charts, and tables without requiring specialized models or additional training. DocsRay's framework synergistically combines three key techniques: (1) a semantic structuring module using prompt-based LLM interactions to generate a hierarchical pseudo-TOC, (2) zero-shot multimodal analysis that converts diverse document elements into unified, text-centric representations using the inherent…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Handwritten Text Recognition Techniques
