Auditing the Reliability of Multimodal Generative Search
Erfan Samieyan Sahneh, Luca Maria Aiello

TL;DR
This paper audits the reliability of multimodal generative search systems, revealing a significant portion of claims are unsupported by cited videos, highlighting trustworthiness issues.
Contribution
It provides the first large-scale analysis of video-grounded claims in multimodal search, identifying common failure modes and factors linked to unsupported claims.
Findings
3.7% to 18.7% of claims are unsupported by sources
Unsupported claims often involve unverifiable details and overstated assertions
Claims with low semantic similarity to videos are more likely unsupported
Abstract
Multimodal Large Language Models (MLLMs) increasingly function as generative search systems that retrieve and synthesize answers from multimedia content, including YouTube videos. Although these systems project authority by citing specific videos as evidence, the extent to which these citations genuinely substantiate the generated claims remains unexamined. We present a large-scale audit of the Gemini 2.5 Pro multimodal search system, analyzing 11,943 claim-video pairs generated across Medical, Economic, and General domains. Through automated verification using three independent LLM judges (87.7% inter-rater agreement), validated against human annotations, we find that depending on the judge's strictness, between 3.7% and 18.7% of video-grounded claims are not supported by their cited sources. The dominant failure modes are not outright contradictions but rather unverifiable…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
