TL;DR
TrackMAE introduces a novel masked video modeling approach that explicitly incorporates motion information through point tracking and motion-aware masking, leading to improved video representations for motion-centric tasks.
Contribution
It proposes using point trajectories and motion-aware masking in masked video modeling to enhance the encoding of temporal dynamics.
Findings
Outperforms state-of-the-art video self-supervised learning methods on six datasets.
Learns more discriminative and generalizable video representations.
Improves performance on motion-centric downstream tasks.
Abstract
Masked video modeling (MVM) has emerged as a simple and scalable self-supervised pretraining paradigm, but only encodes motion information implicitly, limiting the encoding of temporal dynamics in the learned representations. As a result, such models struggle on motion-centric tasks that require fine-grained motion awareness. To address this, we propose TrackMAE, a simple masked video modeling paradigm that explicitly uses motion information as a reconstruction signal. In TrackMAE, we use an off-the-shelf point tracker to sparsely track points in the input videos, generating motion trajectories. Furthermore, we exploit the extracted trajectories to improve random tube masking with a motion-aware masking strategy. We enhance video representations learned in both pixel and feature semantic reconstruction spaces by providing a complementary supervision signal in the form of motion targets.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
