A Multi-Granularity Retrieval Framework for Visually-Rich Documents

Mingjun Xu; Zehui Wang; Hengxing Cai; Renxin Zhong

arXiv:2505.01457·cs.IR·May 7, 2025

A Multi-Granularity Retrieval Framework for Visually-Rich Documents

Mingjun Xu, Zehui Wang, Hengxing Cai, Renxin Zhong

PDF

Open Access

TL;DR

This paper introduces a multi-granularity multimodal retrieval framework for visually-rich documents, combining hierarchical encoding, modality-aware retrieval, and vision-language models to improve accuracy without fine-tuning.

Contribution

It presents a unified retrieval framework that effectively handles complex visual and textual data in documents, using off-the-shelf models and hybrid strategies for robust performance.

Findings

01

Achieves a top score of 65.56 in retrieval accuracy.

02

Enhances retrieval with layout-aware search and VLM-based verification.

03

Operates effectively without task-specific fine-tuning.

Abstract

Retrieval-augmented generation (RAG) systems have predominantly focused on text-based retrieval, limiting their effectiveness in handling visually-rich documents that encompass text, images, tables, and charts. To bridge this gap, we propose a unified multi-granularity multimodal retrieval framework tailored for two benchmark tasks: MMDocIR and M2KR. Our approach integrates hierarchical encoding strategies, modality-aware retrieval mechanisms, and vision-language model (VLM)-based candidate filtering to effectively capture and utilize the complex interdependencies between textual and visual modalities. By leveraging off-the-shelf vision-language models and implementing a training-free hybrid retrieval strategy, our framework demonstrates robust performance without the need for task-specific fine-tuning. Experimental evaluations reveal that incorporating layout-aware search and VLM-based…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSemantic Web and Ontologies · Web Data Mining and Analysis