MAVIS: A Benchmark for Multimodal Source Attribution in Long-form Visual Question Answering

Seokwon Song; Minsu Park; and Gunhee Kim

arXiv:2511.12142·cs.CV·December 19, 2025

MAVIS: A Benchmark for Multimodal Source Attribution in Long-form Visual Question Answering

Seokwon Song, Minsu Park, and Gunhee Kim

PDF

Open Access 1 Datasets 1 Video

TL;DR

MAVIS is a new benchmark dataset for evaluating multimodal source attribution in long-form visual question answering, emphasizing the importance of references, evidence retrieval, and answer quality in multimodal AI systems.

Contribution

The paper introduces MAVIS, the first comprehensive benchmark for multimodal source attribution in visual QA, along with automatic metrics and analysis of multimodal large language models.

Findings

01

Multimodal RAG models produce more informative and fluent answers than unimodal models.

02

Groundedness is weaker for image documents compared to text documents.

03

Mitigating contextual bias in image interpretation is a key future research direction.

Abstract

Source attribution aims to enhance the reliability of AI-generated answers by including references for each statement, helping users validate the provided answers. However, existing work has primarily focused on text-only scenario and largely overlooked the role of multimodality. We introduce MAVIS, the first benchmark designed to evaluate multimodal source attribution systems that understand user intent behind visual questions, retrieve multimodal evidence, and generate long-form answers with citations. Our dataset comprises 157K visual QA instances, where each answer is annotated with fact-level citations referring to multimodal documents. We develop fine-grained automatic metrics along three dimensions of informativeness, groundedness, and fluency, and demonstrate their strong correlation with human judgments. Our key findings are threefold: (1) LVLMs with multimodal RAG generate…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

seokwon99/MAVIS
dataset· 103 dl
103 dl

Videos

MAVIS: A Benchmark for Multimodal Source Attribution in Long-form Visual Question Answering· underline

Taxonomy

TopicsMultimodal Machine Learning Applications · Topic Modeling · Advanced Graph Neural Networks