Text-based Aerial-Ground Person Retrieval
Xinyu Zhou, Yu Wu, Jiayao Ma, Wenhao Wang, Min Cao, Mang Ye

TL;DR
This paper introduces a new task of retrieving person images from aerial and ground views using text descriptions, supported by a new dataset and a novel retrieval framework that handles large viewpoint differences.
Contribution
It presents the TAG-PEDES dataset with diversified textual descriptions and the TAG-CLIP framework that effectively manages view heterogeneity through specialized modules.
Findings
TAG-CLIP outperforms existing methods on TAG-PEDES and T-PR benchmarks.
The dataset enables robust training for cross-view text-based person retrieval.
Viewpoint decoupling improves cross-modal alignment in heterogeneous views.
Abstract
This work introduces Text-based Aerial-Ground Person Retrieval (TAG-PR), which aims to retrieve person images from heterogeneous aerial and ground views with textual descriptions. Unlike traditional Text-based Person Retrieval (T-PR), which focuses solely on ground-view images, TAG-PR introduces greater practical significance and presents unique challenges due to the large viewpoint discrepancy across images. To support this task, we contribute: (1) TAG-PEDES dataset, constructed from public benchmarks with automatically generated textual descriptions, enhanced by a diversified text generation paradigm to ensure robustness under view heterogeneity; and (2) TAG-CLIP, a novel retrieval framework that addresses view heterogeneity through a hierarchically-routed mixture of experts module to learn view-specific and view-agnostic features and a viewpoint decoupling strategy to decouple…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsVideo Surveillance and Tracking Methods · UAV Applications and Optimization · Advanced Neural Network Applications
