Can LLM Agents Respond to Disasters? Benchmarking Heterogeneous Geospatial Reasoning in Emergency Operations

Junjue Wang; Weihao Xuan; Heli Qi; Pengyu Dai; Kunyi Liu; Hongruixuan Chen; Zhuo Zheng; Junshi Xia; Stefano Ermon; Naoto Yokoya

arXiv:2605.11633·cs.AI·May 13, 2026

Can LLM Agents Respond to Disasters? Benchmarking Heterogeneous Geospatial Reasoning in Emergency Operations

Junjue Wang, Weihao Xuan, Heli Qi, Pengyu Dai, Kunyi Liu, Hongruixuan Chen, Zhuo Zheng, Junshi Xia, Stefano Ermon, Naoto Yokoya

PDF

TL;DR

This paper introduces DORA, a comprehensive benchmark for evaluating large language models' ability to perform end-to-end disaster response tasks using heterogeneous geospatial data, revealing key challenges in grounding, tool use, and compositional reasoning.

Contribution

DORA is the first extensive benchmark with real-world disaster tasks, expert-verified trajectories, and diverse geospatial data, enabling systematic evaluation of LLMs in emergency operations.

Findings

01

13 LLMs evaluated reveal grounding and tool-selection challenges.

02

Gold tool-order hints improve accuracy by only 1-4%.

03

Longer response pipelines significantly increase performance gaps.

Abstract

Operational disaster response goes beyond damage assessment, requiring responders to integrate multi-sensor signals, reason over road networks, populations and key facilities, plan evacuations, and produce actionable reports. However, prior work largely isolates remote-sensing perception or evaluates generic tool use, leaving the end-to-end workflows of emergency operations underexplored. In this paper, we introduce Disaster Operational Response Agent benchmark (DORA), the first agentic benchmark for end-to-end disaster response: 515 expert-authored tasks across 45 real-world disaster events spanning 10 types, paired with expert-verified, replayable gold trajectories totaling 3,500 tool-call steps. Tasks span five dimensions that cover the operational disaster-response pipeline: disaster perception, spatial relational analysis, rescue and evacuation planning, temporal evolution…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.