AerialMind: Towards Referring Multi-Object Tracking in UAV Scenarios

Chenglizhao Chen; Shaofeng Liang; Runwei Guan; Xiaolou Sun; Haocheng Zhao; Haiyun Jiang; Tao Huang; Henghui Ding; Qing-Long Han

arXiv:2511.21053·cs.RO·December 2, 2025

AerialMind: Towards Referring Multi-Object Tracking in UAV Scenarios

Chenglizhao Chen, Shaofeng Liang, Runwei Guan, Xiaolou Sun, Haocheng Zhao, Haiyun Jiang, Tao Huang, Henghui Ding, Qing-Long Han

PDF

Open Access 1 Video

TL;DR

AerialMind introduces a large-scale benchmark and a novel method for referring multi-object tracking in UAV scenarios, leveraging aerial perspectives and natural language instructions for improved scene understanding.

Contribution

The paper presents the first UAV-specific RMOT benchmark, a semi-automated annotation framework, and a new collaborative vision-language learning method called HawkEyeTrack.

Findings

01

The dataset is challenging and diverse.

02

HawkEyeTrack improves tracking accuracy.

03

The annotation framework reduces labeling costs.

Abstract

Referring Multi-Object Tracking (RMOT) aims to achieve precise object detection and tracking through natural language instructions, representing a fundamental capability for intelligent robotic systems. However, current RMOT research remains mostly confined to ground-level scenarios, which constrains their ability to capture broad-scale scene contexts and perform comprehensive tracking and path planning. In contrast, Unmanned Aerial Vehicles (UAVs) leverage their expansive aerial perspectives and superior maneuverability to enable wide-area surveillance. Moreover, UAVs have emerged as critical platforms for Embodied Intelligence, which has given rise to an unprecedented demand for intelligent aerial systems capable of natural language interaction. To this end, we introduce AerialMind, the first large-scale RMOT benchmark in UAV scenarios, which aims to bridge this research gap. To…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

AerialMind: Towards Referring Multi-Object Tracking in UAV Scenarios· underline

Taxonomy

TopicsMultimodal Machine Learning Applications · Video Surveillance and Tracking Methods · UAV Applications and Optimization