$G^2$-Reader: Dual Evolving Graphs for Multimodal Document QA
Yaxin Du, Junru Song, Yifan Zhou, Cheng Wang, Jiahao Gu, Zimeng Chen, Menglan Chen, Wen Yao, Yang Yang, Ying Wen, Siheng Chen

TL;DR
G^2-Reader introduces a dual-graph system that preserves document structure and guides multimodal question answering, significantly improving accuracy over existing methods on diverse benchmarks.
Contribution
It proposes a novel dual-graph architecture combining content and planning graphs to enhance multimodal document QA performance.
Findings
Achieves 66.21% accuracy on VisDoMBench, surpassing baselines.
Outperforms GPT-5 with 53.08% accuracy.
Effectively handles long, complex multimodal documents.
Abstract
Retrieval-augmented generation is a practical paradigm for question answering over long documents, but it remains brittle for multimodal reading where text, tables, and figures are interleaved across many pages. First, flat chunking breaks document-native structure and cross-modal alignment, yielding semantic fragments that are hard to interpret in isolation. Second, even iterative retrieval can fail in long contexts by looping on partial evidence or drifting into irrelevant sections as noise accumulates, since each step is guided only by the current snippet without a persistent global search state. We introduce -Reader, a dual-graph system, to address both issues. It evolves a Content Graph to preserve document-native structure and cross-modal semantics, and maintains a Planning Graph, an agentic directed acyclic graph of sub-questions, to track intermediate findings and guide…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Multimodal Machine Learning Applications · Natural Language Processing Techniques
