Can LLM Agents Respond to Disasters? Benchmarking Heterogeneous Geospatial Reasoning in Emergency Operations
Junjue Wang, Weihao Xuan, Heli Qi, Pengyu Dai, Kunyi Liu, Hongruixuan Chen, Zhuo Zheng, Junshi Xia, Stefano Ermon, Naoto Yokoya

TL;DR
This paper introduces DORA, a comprehensive benchmark for evaluating large language models' ability to perform end-to-end disaster response tasks using heterogeneous geospatial data, revealing key challenges in grounding, tool use, and compositional reasoning.
Contribution
DORA is the first extensive benchmark with real-world disaster tasks, expert-verified trajectories, and diverse geospatial data, enabling systematic evaluation of LLMs in emergency operations.
Findings
13 LLMs evaluated reveal grounding and tool-selection challenges.
Gold tool-order hints improve accuracy by only 1-4%.
Longer response pipelines significantly increase performance gaps.
Abstract
Operational disaster response goes beyond damage assessment, requiring responders to integrate multi-sensor signals, reason over road networks, populations and key facilities, plan evacuations, and produce actionable reports. However, prior work largely isolates remote-sensing perception or evaluates generic tool use, leaving the end-to-end workflows of emergency operations underexplored. In this paper, we introduce Disaster Operational Response Agent benchmark (DORA), the first agentic benchmark for end-to-end disaster response: 515 expert-authored tasks across 45 real-world disaster events spanning 10 types, paired with expert-verified, replayable gold trajectories totaling 3,500 tool-call steps. Tasks span five dimensions that cover the operational disaster-response pipeline: disaster perception, spatial relational analysis, rescue and evacuation planning, temporal evolution…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
