MERRIN: A Benchmark for Multimodal Evidence Retrieval and Reasoning in Noisy Web Environments

Han Wang; David Wan; Hyunji Lee; Thinh Pham; Mikaela Cankosyan; Weiyuan Chen; Elias Stengel-Eskin; Tu Vu; Mohit Bansal

arXiv:2604.13418·cs.CL·April 16, 2026

MERRIN: A Benchmark for Multimodal Evidence Retrieval and Reasoning in Noisy Web Environments

Han Wang, David Wan, Hyunji Lee, Thinh Pham, Mikaela Cankosyan, Weiyuan Chen, Elias Stengel-Eskin, Tu Vu, Mohit Bansal

PDF

1 Repo 1 Datasets

TL;DR

MERRIN is a challenging, human-annotated benchmark designed to evaluate AI agents' ability to retrieve and reason over multimodal, noisy web data using natural language queries.

Contribution

It introduces a novel benchmark that emphasizes multimodal evidence retrieval, multi-hop reasoning, and handles noisy, conflicting web sources, filling gaps in prior datasets.

Findings

01

Average accuracy across agents is only 22.3%.

02

Strongest agents reach just 40.1% accuracy.

03

Agents tend to overexplore and struggle with conflicting information.

Abstract

Motivated by the underspecified, multi-hop nature of search queries and the multimodal, heterogeneous, and often conflicting nature of real-world web results, we introduce MERRIN (Multimodal Evidence Retrieval and Reasoning in Noisy Web Environments), a human-annotated benchmark for evaluating search-augmented agents. MERRIN measures AI agents' ability to identify relevant modalities, retrieve multimodal evidence, and perform multi-hop reasoning over noisy web sources. It differs from prior work in three important aspects: (1) using natural language queries without explicit modality cues, (2) incorporating underexplored modalities such as video and audio, and (3) requiring the retrieval of complex, often noisy or conflicting multimodal evidence during web search. We evaluate diverse search agents powered by ten models, including strong closed-source models (e.g., GPT-5.4-mini, Gemini…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

hannight/MERRIN
github

Datasets

HanNight/MERRIN
dataset· 137 dl
137 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.