DMAP: Human-Aligned Structural Document Map for Multimodal Document Understanding
ShunLiang Fu, Yanxin Zhang, Yixin Xiang, Xiaoyu Du, Jinhui Tang

TL;DR
This paper introduces DMAP, a hierarchical and relational document representation that improves multimodal question-answering by aligning with human understanding and enabling structure-aware reasoning.
Contribution
The paper presents a novel hierarchical document map (DMAP) that encodes structural and relational information, enhancing multimodal document understanding and reasoning.
Findings
DMAP improves retrieval precision and reasoning consistency.
DMAP outperforms conventional RAG-based approaches on MMDocQA benchmarks.
The approach aligns document representations with human interpretive patterns.
Abstract
Existing multimodal document question-answering (QA) systems predominantly rely on flat semantic retrieval, representing documents as a set of disconnected text chunks and largely neglecting their intrinsic hierarchical and relational structures. Such flattening disrupts logical and spatial dependencies - such as section organization, figure-text correspondence, and cross-reference relations, that humans naturally exploit for comprehension. To address this limitation, we introduce a document-level structural Document MAP (DMAP), which explicitly encodes both hierarchical organization and inter-element relationships within multimodal documents. Specifically, we design a Structured-Semantic Understanding Agent to construct DMAP by organizing textual content together with figures, tables, charts, etc. into a human-aligned hierarchical schema that captures both semantic and layout…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Multimodal Machine Learning Applications · Information Retrieval and Search Behavior
