ThinkGeo: Evaluating Tool-Augmented Agents for Remote Sensing Tasks

Akashah Shabbir; Muhammad Akhtar Munir; Akshay Dudhane; Muhammad Umer Sheikh; Muhammad Haris Khan; Paolo Fraccaro; Juan Bernabe Moreno; Fahad Shahbaz Khan; Salman Khan

arXiv:2505.23752·cs.CV·April 3, 2026

ThinkGeo: Evaluating Tool-Augmented Agents for Remote Sensing Tasks

Akashah Shabbir, Muhammad Akhtar Munir, Akshay Dudhane, Muhammad Umer Sheikh, Muhammad Haris Khan, Paolo Fraccaro, Juan Bernabe Moreno, Fahad Shahbaz Khan, Salman Khan

PDF

1 Datasets

TL;DR

ThinkGeo is a new benchmark for evaluating large language model-driven agents on complex remote sensing tasks involving structured tool use and multi-step reasoning with satellite and aerial imagery.

Contribution

It introduces a comprehensive remote sensing benchmark with real-world queries, implementing a ReAct-style interaction loop to assess LLMs' spatial reasoning and tool use capabilities.

Findings

01

Significant disparities in tool accuracy among models.

02

Evaluation of 486 structured tasks with 1,778 reasoning steps.

03

Benchmark reveals strengths and weaknesses of LLMs in remote sensing.

Abstract

Recent progress in large language models (LLMs) has enabled tool-augmented agents capable of solving complex real-world tasks through step-by-step reasoning. However, existing evaluations often focus on general-purpose or multimodal scenarios, leaving a gap in domain-specific benchmarks that assess tool-use capabilities in complex remote sensing use cases. We present ThinkGeo, an agentic benchmark designed to evaluate LLM-driven agents on remote sensing tasks via structured tool use and multi-step planning. Inspired by tool-interaction paradigms, ThinkGeo includes human-curated queries spanning a wide range of real-world applications such as urban planning, disaster assessment and change analysis, environmental monitoring, transportation analysis, aviation monitoring, recreational infrastructure, and industrial site analysis. Queries are grounded in satellite or aerial imagery,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

ganghyunnnn/rs-taxonomy-labels
dataset· 243 dl
243 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.