VisDoM: Multi-Document QA with Visually Rich Elements Using Multimodal Retrieval-Augmented Generation
Manan Suri, Puneet Mathur, Franck Dernoncourt, Kanika Goswami, Ryan A., Rossi, Dinesh Manocha

TL;DR
This paper introduces VisDoMBench, a comprehensive benchmark for multi-document question answering with visually rich content, and proposes VisDoMRAG, a multimodal retrieval-augmented generation method that improves accuracy and consistency across modalities.
Contribution
The paper presents the first benchmark for multimodal multi-document QA and introduces a novel multimodal RAG approach with a consistency-constrained fusion mechanism.
Findings
VisDoMRAG outperforms unimodal and long-context LLM baselines by 12-20%.
The benchmark effectively evaluates multimodal document QA systems.
The proposed method enhances answer accuracy and interpretability across visual and textual data.
Abstract
Understanding information from a collection of multiple documents, particularly those with visually rich elements, is important for document-grounded question answering. This paper introduces VisDoMBench, the first comprehensive benchmark designed to evaluate QA systems in multi-document settings with rich multimodal content, including tables, charts, and presentation slides. We propose VisDoMRAG, a novel multimodal Retrieval Augmented Generation (RAG) approach that simultaneously utilizes visual and textual RAG, combining robust visual retrieval capabilities with sophisticated linguistic reasoning. VisDoMRAG employs a multi-step reasoning process encompassing evidence curation and chain-of-thought reasoning for concurrent textual and visual RAG pipelines. A key novelty of VisDoMRAG is its consistency-constrained modality fusion mechanism, which aligns the reasoning processes across…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsNatural Language Processing Techniques · Web Data Mining and Analysis · Video Analysis and Summarization
MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Attention Is All You Need · Linear Layer · Dense Connections · Byte Pair Encoding · Multi-Head Attention · Residual Connection · Attention Dropout · Linear Warmup With Linear Decay · Weight Decay
