MAVIS: A Benchmark for Multimodal Source Attribution in Long-form Visual Question Answering
Seokwon Song, Minsu Park, and Gunhee Kim

TL;DR
MAVIS is a new benchmark dataset for evaluating multimodal source attribution in long-form visual question answering, emphasizing the importance of references, evidence retrieval, and answer quality in multimodal AI systems.
Contribution
The paper introduces MAVIS, the first comprehensive benchmark for multimodal source attribution in visual QA, along with automatic metrics and analysis of multimodal large language models.
Findings
Multimodal RAG models produce more informative and fluent answers than unimodal models.
Groundedness is weaker for image documents compared to text documents.
Mitigating contextual bias in image interpretation is a key future research direction.
Abstract
Source attribution aims to enhance the reliability of AI-generated answers by including references for each statement, helping users validate the provided answers. However, existing work has primarily focused on text-only scenario and largely overlooked the role of multimodality. We introduce MAVIS, the first benchmark designed to evaluate multimodal source attribution systems that understand user intent behind visual questions, retrieve multimodal evidence, and generate long-form answers with citations. Our dataset comprises 157K visual QA instances, where each answer is annotated with fact-level citations referring to multimodal documents. We develop fine-grained automatic metrics along three dimensions of informativeness, groundedness, and fluency, and demonstrate their strong correlation with human judgments. Our key findings are threefold: (1) LVLMs with multimodal RAG generate…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsMultimodal Machine Learning Applications · Topic Modeling · Advanced Graph Neural Networks
