GeoBrowse: A Geolocation Benchmark for Agentic Tool Use with Expert-Annotated Reasoning Traces

Xinyu Geng; Yanjing Xiao; Yuyang Zhang; Hanwen Wang; Xinyan Liu; Rui Min; Tianqing Fang; Yi R. Fung

arXiv:2604.04017·cs.CL·April 7, 2026

GeoBrowse: A Geolocation Benchmark for Agentic Tool Use with Expert-Annotated Reasoning Traces

Xinyu Geng, Yanjing Xiao, Yuyang Zhang, Hanwen Wang, Xinyan Liu, Rui Min, Tianqing Fang, Yi R. Fung

PDF

1 Repo

TL;DR

GeoBrowse introduces a geolocation benchmark combining visual reasoning and multi-hop knowledge verification, supported by an agentic workflow and expert-annotated reasoning traces, to evaluate complex multi-step evidence integration.

Contribution

It presents a new geolocation benchmark with multi-level difficulty, a comprehensive evaluation workflow, and expert-annotated reasoning traces, advancing research in knowledge-intensive multi-step reasoning tasks.

Findings

01

GATE outperforms direct inference and open-source agents.

02

Coherent, level-specific tool-use plans improve evidence verification.

03

No-tool, search-only, or image-only setups are insufficient for accurate geolocation.

Abstract

Deep research agents integrate fragmented evidence through multi-step tool use. BrowseComp offers a text-only testbed for such agents, but existing multimodal benchmarks rarely require both weak visual cues composition and BrowseComp-style multi-hop verification. Geolocation is a natural testbed because answers depend on combining multiple ambiguous visual cues and validating them with open-web evidence. Thus, we introduce GeoBrowse, a geolocation benchmark that combines visual reasoning with knowledge-intensive multi-hop queries. Level 1 tests extracting and composing fragmented visual cues, and Level 2 increases query difficulty by injecting long-tail knowledge and obfuscating key entities. To support evaluation, we provide an agentic workflow GATE with five think-with-image tools and four knowledge-intensive tools, and release expert-annotated stepwise traces grounded in verifiable…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

ornamentt/GeoBrowse
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.