DocCogito: Aligning Layout Cognition and Step-Level Grounded Reasoning for Document Understanding

Yuchuan Wu; Minghan Zhuo; Teng Fu; Mengyang Zhao; Bin Li; Xiangyang Xue

arXiv:2603.07494·cs.CV·March 10, 2026

DocCogito: Aligning Layout Cognition and Step-Level Grounded Reasoning for Document Understanding

Yuchuan Wu, Minghan Zhuo, Teng Fu, Mengyang Zhao, Bin Li, Xiangyang Xue

PDF

Open Access

TL;DR

DocCogito introduces a unified framework for document understanding that combines global layout perception with structured reasoning, significantly improving evidence-grounded answers in multimodal large language models.

Contribution

The paper presents a novel integrated approach with a layout tower and Visual-Semantic Chain for explicit, evidence-grounded reasoning in document understanding tasks.

Findings

01

Achieves state-of-the-art results on four benchmarks

02

Demonstrates strong generalization across six datasets

03

Enhances reasoning accuracy with evidence alignment mechanisms

Abstract

Document understanding with multimodal large language models (MLLMs) requires not only accurate answers but also explicit, evidence-grounded reasoning, especially in high-stakes scenarios. However, current document MLLMs still fall short of forming a complete, human-like reasoning process, because even when they improve both layout encoding and CoT-style prompting, the interaction between the two is typically learned implicitly and remains loosely coupled rather than being enforced as a systematic mechanism. So we propose DocCogito, a unified framework that integrates global layout perception with structured, region-grounded reasoning. DocCogito introduces a lightweight layout tower that distills page structure into learnable global layout prior tokens, and a deterministic Visual-Semantic Chain (VSC)-a concise structured representation less ambiguous than free-form natural-language…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Multimodal Machine Learning Applications · Text Readability and Simplification