TL;DR
ESARBench introduces a comprehensive, realistic benchmark for evaluating multimodal large language model-driven UAV agents in complex search and rescue scenarios, addressing a key gap in the field.
Contribution
The paper proposes the novel ESAR task and presents the first high-fidelity, GIS-mapped benchmark for UAV search and rescue, including diverse environments and evaluation metrics.
Findings
Experimental results reveal challenges in spatial memory and aerial adaptation.
Trade-offs identified between search efficiency and flight safety.
Baseline evaluations highlight bottlenecks in current UAV SAR methods.
Abstract
The rapid advancement of Multimodal Large Language Models (MLLMs) has empowered Unmanned Aerial Vehicle (UAV) with exceptional capabilities in spatial reasoning, semantic understanding, and complex decision-making, making them inherently suited for UAV Search and Rescue (SAR). However, existing UAV SAR research is dominated by traditional vision and path-planning methods and lacks a comprehensive and unified benchmark for embodied agents. To bridge this gap, we first propose the novel task of \textbf{Embodied Search and Rescue (ESAR)}, which requires aerial agents to autonomously explore complex environments, identify rescue clues, and reason about victim locations to execute informed decision-making. Additionally, we present \textbf{ESARBench}, the first comprehensive benchmark designed to evaluate MLLM-driven UAV agents in highly realistic SAR scenarios. Leveraging Unreal Engine 5 and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
