FALCON: Future-Aware Learning with Contextual Object-Centric Pretraining for UAV Action Recognition
Ruiqi Xian, Xiyang Wu, Tianrui Guan, Xijun Wang, Boqing Gong, Dinesh Manocha

TL;DR
FALCON is a self-supervised pretraining method for UAV action recognition that focuses on object-centric regions and future content reconstruction, improving accuracy and efficiency on aerial video benchmarks.
Contribution
It introduces an object-aware masked autoencoding approach with dual-horizon future reconstruction for UAV videos, emphasizing action-relevant regions and temporal dynamics.
Findings
Improves top-1 accuracy by 2.9% on NEC-Drone
Achieves 5.8% improvement on UAV-Human
Offers 2-5x faster inference than supervised methods
Abstract
We introduce FALCON, a unified self-supervised video pretraining approach for UAV action recognition from raw RGB aerial footage, requiring no additional preprocessing at inference. UAV videos exhibit severe spatial imbalance: large, cluttered backgrounds dominate the field of view, causing reconstruction-based pretraining to waste capacity on uninformative regions and under-learn action-relevant human/object cues. FALCON addresses this by integrating object-aware masked autoencoding with object-centric dual-horizon future reconstruction. Using detections only during pretraining, we construct objectness priors that (i) enforce balanced token visibility during masking and (ii) concentrate reconstruction supervision on action-relevant regions, preventing learning from being dominated by background appearance. To promote temporal dynamics learning, we further reconstruct short- and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHuman Pose and Action Recognition · Video Surveillance and Tracking Methods · Anomaly Detection Techniques and Applications
MethodsSPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings
