SV-RAG: LoRA-Contextualizing Adaptation of MLLMs for Long Document   Understanding

Jian Chen; Ruiyi Zhang; Yufan Zhou; Tong Yu; Franck Dernoncourt,; Jiuxiang Gu; Ryan A. Rossi; Changyou Chen; Tong Sun

arXiv:2411.01106·cs.CV·March 4, 2025

SV-RAG: LoRA-Contextualizing Adaptation of MLLMs for Long Document Understanding

Jian Chen, Ruiyi Zhang, Yufan Zhou, Tong Yu, Franck Dernoncourt,, Jiuxiang Gu, Ryan A. Rossi, Changyou Chen, Tong Sun

PDF

Open Access 2 Models 1 Datasets

TL;DR

SV-RAG introduces a novel framework that enhances multimodal large language models' ability to understand long, complex documents by using MLLMs as retrievers and adapters for evidence retrieval and question answering, achieving state-of-the-art results.

Contribution

The paper presents SV-RAG, a new method that leverages MLLMs as both retrievers and answerers, enabling efficient long document understanding without traditional parsers.

Findings

01

Achieves state-of-the-art performance on public benchmarks.

02

Demonstrates MLLMs can effectively retrieve relevant document pages.

03

Improves efficiency and accuracy in long document comprehension.

Abstract

Multimodal large language models (MLLMs) have recently shown great progress in text-rich image understanding, yet they still struggle with complex, multi-page visually-rich documents. Traditional methods using document parsers for retrieval-augmented generation suffer from performance and efficiency limitations, while directly presenting all pages to MLLMs leads to inefficiencies, especially with lengthy ones. In this work, we present a novel framework named **S**elf-**V**isual **R**etrieval-**A**ugmented **G**eneration (SV-RAG), which can broaden horizons of any MLLM to support long-document understanding. We demonstrate that **MLLMs themselves can be an effective multimodal retriever** to fetch relevant pages and then answer user questions based on these pages. SV-RAG is implemented with two specific MLLM adapters, one for evidence page retrieval and the other for question answering.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

Datasets

puar-playground/VisR-Bench
dataset· 25 dl
25 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling