TL;DR
InterLV-Search introduces a comprehensive benchmark for evaluating interleaved multimodal agentic search, emphasizing visual evidence integration and search trajectory management, revealing current system limitations.
Contribution
It presents a new benchmark with diverse levels and a standardized agent for evaluating interleaved language-vision search tasks, including open-web scenarios.
Findings
Current models achieve below 50% accuracy on the benchmark.
Visual evidence seeking and multimodal evidence integration remain challenging.
The benchmark includes automated and human-supervised data pipelines.
Abstract
Existing benchmarks for multimodal agentic search evaluate multimodal search and visual browsing, but visual evidence is either confined to the input or treated as an answer endpoint rather than part of an interleaved search trajectory. We introduce \textbf{InterLV-Search}, a benchmark for Interleaved Language-Vision Agentic Search, in which textual and visual evidence is repeatedly used to condition later search. It contains 2,061 examples across three levels: active visual evidence seeking, controlled offline interleaved multimodal search, and open-web interleaved multimodal search. Beyond existing benchmarks, it also includes multimodal multi-branch samples that involve comparison between multiple entities during the evidence search. We construct Level 1 and Level 2 with automated pipelines and Level 3 with a machine-led, human-supervised open-web pipeline. We further provide…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
