AerialMind: Towards Referring Multi-Object Tracking in UAV Scenarios
Chenglizhao Chen, Shaofeng Liang, Runwei Guan, Xiaolou Sun, Haocheng Zhao, Haiyun Jiang, Tao Huang, Henghui Ding, Qing-Long Han

TL;DR
AerialMind introduces a large-scale benchmark and a novel method for referring multi-object tracking in UAV scenarios, leveraging aerial perspectives and natural language instructions for improved scene understanding.
Contribution
The paper presents the first UAV-specific RMOT benchmark, a semi-automated annotation framework, and a new collaborative vision-language learning method called HawkEyeTrack.
Findings
The dataset is challenging and diverse.
HawkEyeTrack improves tracking accuracy.
The annotation framework reduces labeling costs.
Abstract
Referring Multi-Object Tracking (RMOT) aims to achieve precise object detection and tracking through natural language instructions, representing a fundamental capability for intelligent robotic systems. However, current RMOT research remains mostly confined to ground-level scenarios, which constrains their ability to capture broad-scale scene contexts and perform comprehensive tracking and path planning. In contrast, Unmanned Aerial Vehicles (UAVs) leverage their expansive aerial perspectives and superior maneuverability to enable wide-area surveillance. Moreover, UAVs have emerged as critical platforms for Embodied Intelligence, which has given rise to an unprecedented demand for intelligent aerial systems capable of natural language interaction. To this end, we introduce AerialMind, the first large-scale RMOT benchmark in UAV scenarios, which aims to bridge this research gap. To…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsMultimodal Machine Learning Applications · Video Surveillance and Tracking Methods · UAV Applications and Optimization
