MHier-RAG: Multi-Modal RAG for Visual-Rich Document Question-Answering via Hierarchical and Multi-Granularity Reasoning

Ziyu Gong; Chengcheng Mai; Yihua Huang

arXiv:2508.00579·cs.MM·October 6, 2025

MHier-RAG: Multi-Modal RAG for Visual-Rich Document Question-Answering via Hierarchical and Multi-Granularity Reasoning

Ziyu Gong, Chengcheng Mai, Yihua Huang

PDF

Open Access

TL;DR

MHier-RAG is a novel multi-modal retrieval-augmented generation model that effectively integrates multi-page, multi-modal evidence for accurate question answering in visual-rich documents, addressing limitations of previous methods.

Contribution

The paper introduces MHier-RAG, a hierarchical multi-granularity retrieval and reasoning framework for multi-modal long-document question answering, combining hierarchical indexing and semantic re-ranking.

Findings

01

Outperforms existing methods on public datasets

02

Effectively connects multi-modal evidence across pages

03

Enhances understanding of visual-rich, multi-page documents

Abstract

The multi-modal long-context document question-answering task aims to locate and integrate multi-modal evidences (such as texts, tables, charts, images, and layouts) distributed across multiple pages, for question understanding and answer generation. The existing methods can be categorized into Large Vision-Language Model (LVLM)-based and Retrieval-Augmented Generation (RAG)-based methods. However, the former were susceptible to hallucinations, while the latter struggled for inter-modal disconnection and cross-page fragmentation. To address these challenges, a novel multi-modal RAG model, named MHier-RAG, was proposed, leveraging both textual and visual information across long-range pages to facilitate accurate question answering for visual-rich documents. A hierarchical indexing method with the integration of flattened in-page chunks and topological cross-page chunks was designed to…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Topic Modeling · Advanced Image and Video Retrieval Techniques