TL;DR
GeoBrowse introduces a geolocation benchmark combining visual reasoning and multi-hop knowledge verification, supported by an agentic workflow and expert-annotated reasoning traces, to evaluate complex multi-step evidence integration.
Contribution
It presents a new geolocation benchmark with multi-level difficulty, a comprehensive evaluation workflow, and expert-annotated reasoning traces, advancing research in knowledge-intensive multi-step reasoning tasks.
Findings
GATE outperforms direct inference and open-source agents.
Coherent, level-specific tool-use plans improve evidence verification.
No-tool, search-only, or image-only setups are insufficient for accurate geolocation.
Abstract
Deep research agents integrate fragmented evidence through multi-step tool use. BrowseComp offers a text-only testbed for such agents, but existing multimodal benchmarks rarely require both weak visual cues composition and BrowseComp-style multi-hop verification. Geolocation is a natural testbed because answers depend on combining multiple ambiguous visual cues and validating them with open-web evidence. Thus, we introduce GeoBrowse, a geolocation benchmark that combines visual reasoning with knowledge-intensive multi-hop queries. Level 1 tests extracting and composing fragmented visual cues, and Level 2 increases query difficulty by injecting long-tail knowledge and obfuscating key entities. To support evaluation, we provide an agentic workflow GATE with five think-with-image tools and four knowledge-intensive tools, and release expert-annotated stepwise traces grounded in verifiable…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
