EO-Gym: A Multimodal, Interactive Environment for Earth Observation Agents
Sai Ma, Zhuang Li, Sichao Li, Xinyue Xu, Ruibiao Zhu, Tony Boston, John A. Taylor

TL;DR
EO-Gym introduces a comprehensive, interactive environment for Earth Observation analysis, enabling multimodal, tool-using agents to perform complex reasoning across spatial, temporal, and sensor modalities.
Contribution
It provides a novel controlled framework and benchmark for interactive EO reasoning, addressing limitations of existing fixed-input, single-turn tasks.
Findings
Strong models struggle with interactive EO reasoning, especially across time and modalities.
Fine-tuned EO-Gym-4B improves Pass@3 from 0.49 to 0.74.
EO-Gym offers a reproducible environment for developing and evaluating EO agents.
Abstract
Earth Observation (EO) analysis is inherently interactive: resolving uncertainty often requires expanding the region of interest, retrieving historical observations, and switching across sensors such as optical and Synthetic Aperture Radar. However, most EO benchmarks collapse this process into fixed-input, single-turn tasks. To address this gap, we present EO-Gym, a controlled executable framework for multimodal, tool-using EO agents that formulates EO analysis as a Gymnasium-style local geospatial workspace backed by more than 660k multimodal files indexed by location, time, and sensor type, with 35 EO-specialized tools spanning six task families. Built on this environment, we construct EO-Gym-Data, a benchmark of 9,078 trajectories and 34,604 reasoning steps, and grounded in eight public EO datasets together with Landsat and Sentinel-2 imagery. Evaluating open and closed VLMs…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
