VisDoM: Multi-Document QA with Visually Rich Elements Using Multimodal   Retrieval-Augmented Generation

Manan Suri; Puneet Mathur; Franck Dernoncourt; Kanika Goswami; Ryan A.; Rossi; Dinesh Manocha

arXiv:2412.10704·cs.CL·February 12, 2025

VisDoM: Multi-Document QA with Visually Rich Elements Using Multimodal Retrieval-Augmented Generation

Manan Suri, Puneet Mathur, Franck Dernoncourt, Kanika Goswami, Ryan A., Rossi, Dinesh Manocha

PDF

Open Access 1 Video

TL;DR

This paper introduces VisDoMBench, a comprehensive benchmark for multi-document question answering with visually rich content, and proposes VisDoMRAG, a multimodal retrieval-augmented generation method that improves accuracy and consistency across modalities.

Contribution

The paper presents the first benchmark for multimodal multi-document QA and introduces a novel multimodal RAG approach with a consistency-constrained fusion mechanism.

Findings

01

VisDoMRAG outperforms unimodal and long-context LLM baselines by 12-20%.

02

The benchmark effectively evaluates multimodal document QA systems.

03

The proposed method enhances answer accuracy and interpretability across visual and textual data.

Abstract

Understanding information from a collection of multiple documents, particularly those with visually rich elements, is important for document-grounded question answering. This paper introduces VisDoMBench, the first comprehensive benchmark designed to evaluate QA systems in multi-document settings with rich multimodal content, including tables, charts, and presentation slides. We propose VisDoMRAG, a novel multimodal Retrieval Augmented Generation (RAG) approach that simultaneously utilizes visual and textual RAG, combining robust visual retrieval capabilities with sophisticated linguistic reasoning. VisDoMRAG employs a multi-step reasoning process encompassing evidence curation and chain-of-thought reasoning for concurrent textual and visual RAG pipelines. A key novelty of VisDoMRAG is its consistency-constrained modality fusion mechanism, which aligns the reasoning processes across…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

VisDoM: Multi-Document QA with Visually Rich Elements Using Multimodal Retrieval-Augmented Generation· underline

Taxonomy

TopicsNatural Language Processing Techniques · Web Data Mining and Analysis · Video Analysis and Summarization

MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Attention Is All You Need · Linear Layer · Dense Connections · Byte Pair Encoding · Multi-Head Attention · Residual Connection · Attention Dropout · Linear Warmup With Linear Decay · Weight Decay