DocCogito: Aligning Layout Cognition and Step-Level Grounded Reasoning for Document Understanding
Yuchuan Wu, Minghan Zhuo, Teng Fu, Mengyang Zhao, Bin Li, Xiangyang Xue

TL;DR
DocCogito introduces a unified framework for document understanding that combines global layout perception with structured reasoning, significantly improving evidence-grounded answers in multimodal large language models.
Contribution
The paper presents a novel integrated approach with a layout tower and Visual-Semantic Chain for explicit, evidence-grounded reasoning in document understanding tasks.
Findings
Achieves state-of-the-art results on four benchmarks
Demonstrates strong generalization across six datasets
Enhances reasoning accuracy with evidence alignment mechanisms
Abstract
Document understanding with multimodal large language models (MLLMs) requires not only accurate answers but also explicit, evidence-grounded reasoning, especially in high-stakes scenarios. However, current document MLLMs still fall short of forming a complete, human-like reasoning process, because even when they improve both layout encoding and CoT-style prompting, the interaction between the two is typically learned implicitly and remains loosely coupled rather than being enforced as a systematic mechanism. So we propose DocCogito, a unified framework that integrates global layout perception with structured, region-grounded reasoning. DocCogito introduces a lightweight layout tower that distills page structure into learnable global layout prior tokens, and a deterministic Visual-Semantic Chain (VSC)-a concise structured representation less ambiguous than free-form natural-language…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Multimodal Machine Learning Applications · Text Readability and Simplification
