# MMSearch-Plus: Benchmarking Provenance-Aware Search for Multimodal Browsing Agents

**Authors:** Xijia Tao, Yihua Teng, Xinxing Su, Xinyu Fu, Jihao Wu, Chaofan Tao, Ziru Liu, Haoli Bai, Rui Liu, Lingpeng Kong

arXiv: 2508.21475 · 2026-03-20

## TL;DR

MMSearch-Plus is a comprehensive benchmark designed to evaluate multimodal search agents' reasoning abilities, emphasizing visual understanding and provenance tracking, with promising results and identified challenges for future research.

## Contribution

Introduces MMSearch-Plus, a novel 311-task benchmark that enforces multimodal reasoning and provides a model-agnostic agent framework with provenance-aware tools.

## Key findings

- Strongest system achieves 36.0% accuracy
- Provenance-aware tools improve robustness by up to 3.9 points
- Identifies key challenges in locating relevant pages and distinguishing similar events

## Abstract

Existing multimodal browsing benchmarks often fail to require genuine multimodal reasoning, as many tasks can be solved with text-only heuristics without vision-in-the-loop verification. We introduce MMSearch-Plus, a 311-task benchmark that enforces multimodal understanding by requiring extraction and propagation of fine-grained visual cues through iterative image-text retrieval and cross-validation under retrieval noise. Our curation procedure seeds questions whose answers require extrapolating from spatial cues and temporal traces to out-of-image facts such as events, dates, and venues. Beyond the dataset, we provide a model-agnostic agent framework with standard browsing tools and a set-of-mark (SoM) module, which lets the agent place marks, crop subregions, and launch targeted image/text searches. SoM enables provenance-aware zoom-and-retrieve and improves robustness in multi-step reasoning. We evaluated closed- and open-source MLLMs in this framework. The strongest system achieves an end-to-end accuracy of 36.0%, and integrating SoM produces consistent gains in multiple settings, with improvements up to +3.9 points. From failure analysis, we observe recurring errors in locating relevant webpages and distinguishing between visually similar events. These results underscore the challenges of real-world multimodal search and establish MMSearch-Plus as a rigorous benchmark for advancing agentic MLLMs.

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/2508.21475/full.md

## Figures

30 figures with captions in the complete paper: https://tomesphere.com/paper/2508.21475/full.md

## References

52 references — full list in the complete paper: https://tomesphere.com/paper/2508.21475/full.md

---
Source: https://tomesphere.com/paper/2508.21475