TL;DR
This paper introduces MuRGAt, a benchmark for evaluating fact-level attribution in multimodal reasoning tasks involving complex inputs like video and audio, highlighting current models' tendency to hallucinate citations.
Contribution
The paper presents MuRGAt, a new benchmark and evaluation framework for assessing fact-level attribution in multimodal models, addressing limitations of previous simplified benchmarks.
Findings
Strong models often hallucinate citations despite correct reasoning.
Increasing reasoning depth or structured grounding can reduce attribution accuracy.
Automatic evaluation correlates well with human judgments.
Abstract
Multimodal large language models (MLLMs) are increasingly used for real-world tasks involving multi-step reasoning and long-form generation, where reliability requires grounding model outputs in heterogeneous input sources and verifying individual factual claims. However, existing multimodal grounding benchmarks and evaluation methods focus on simplified, observation-based scenarios or limited modalities and fail to assess attribution in complex multimodal reasoning. We introduce MuRGAt (Multimodal Reasoning with Grounded Attribution), a benchmark for evaluating fact-level multimodal attribution in settings that require reasoning beyond direct observation. Given inputs spanning video, audio, and other modalities, MuRGAt requires models to generate answers with explicit reasoning and precise citations, where each citation specifies both modality and temporal segments. To enable reliable…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
