SV-RAG: LoRA-Contextualizing Adaptation of MLLMs for Long Document Understanding
Jian Chen, Ruiyi Zhang, Yufan Zhou, Tong Yu, Franck Dernoncourt,, Jiuxiang Gu, Ryan A. Rossi, Changyou Chen, Tong Sun

TL;DR
SV-RAG introduces a novel framework that enhances multimodal large language models' ability to understand long, complex documents by using MLLMs as retrievers and adapters for evidence retrieval and question answering, achieving state-of-the-art results.
Contribution
The paper presents SV-RAG, a new method that leverages MLLMs as both retrievers and answerers, enabling efficient long document understanding without traditional parsers.
Findings
Achieves state-of-the-art performance on public benchmarks.
Demonstrates MLLMs can effectively retrieve relevant document pages.
Improves efficiency and accuracy in long document comprehension.
Abstract
Multimodal large language models (MLLMs) have recently shown great progress in text-rich image understanding, yet they still struggle with complex, multi-page visually-rich documents. Traditional methods using document parsers for retrieval-augmented generation suffer from performance and efficiency limitations, while directly presenting all pages to MLLMs leads to inefficiencies, especially with lengthy ones. In this work, we present a novel framework named **S**elf-**V**isual **R**etrieval-**A**ugmented **G**eneration (SV-RAG), which can broaden horizons of any MLLM to support long-document understanding. We demonstrate that **MLLMs themselves can be an effective multimodal retriever** to fetch relevant pages and then answer user questions based on these pages. SV-RAG is implemented with two specific MLLM adapters, one for evidence page retrieval and the other for question answering.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling
