FALCON: Future-Aware Learning with Contextual Object-Centric Pretraining for UAV Action Recognition

Ruiqi Xian; Xiyang Wu; Tianrui Guan; Xijun Wang; Boqing Gong; Dinesh Manocha

arXiv:2409.18300·cs.CV·March 9, 2026

FALCON: Future-Aware Learning with Contextual Object-Centric Pretraining for UAV Action Recognition

Ruiqi Xian, Xiyang Wu, Tianrui Guan, Xijun Wang, Boqing Gong, Dinesh Manocha

PDF

Open Access

TL;DR

FALCON is a self-supervised pretraining method for UAV action recognition that focuses on object-centric regions and future content reconstruction, improving accuracy and efficiency on aerial video benchmarks.

Contribution

It introduces an object-aware masked autoencoding approach with dual-horizon future reconstruction for UAV videos, emphasizing action-relevant regions and temporal dynamics.

Findings

01

Improves top-1 accuracy by 2.9% on NEC-Drone

02

Achieves 5.8% improvement on UAV-Human

03

Offers 2-5x faster inference than supervised methods

Abstract

We introduce FALCON, a unified self-supervised video pretraining approach for UAV action recognition from raw RGB aerial footage, requiring no additional preprocessing at inference. UAV videos exhibit severe spatial imbalance: large, cluttered backgrounds dominate the field of view, causing reconstruction-based pretraining to waste capacity on uninformative regions and under-learn action-relevant human/object cues. FALCON addresses this by integrating object-aware masked autoencoding with object-centric dual-horizon future reconstruction. Using detections only during pretraining, we construct objectness priors that (i) enforce balanced token visibility during masking and (ii) concentrate reconstruction supervision on action-relevant regions, preventing learning from being dominated by background appearance. To promote temporal dynamics learning, we further reconstruct short- and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHuman Pose and Action Recognition · Video Surveillance and Tracking Methods · Anomaly Detection Techniques and Applications

MethodsSPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings