TrackMAE: Video Representation Learning via Track Mask and Predict

Renaud Vandeghen; Fida Mohammad Thoker; Marc Van Droogenbroeck; and Bernard Ghanem

arXiv:2603.27268·cs.CV·March 31, 2026

TrackMAE: Video Representation Learning via Track Mask and Predict

Renaud Vandeghen, Fida Mohammad Thoker, Marc Van Droogenbroeck, and Bernard Ghanem

PDF

1 Repo

TL;DR

TrackMAE introduces a novel masked video modeling approach that explicitly incorporates motion information through point tracking and motion-aware masking, leading to improved video representations for motion-centric tasks.

Contribution

It proposes using point trajectories and motion-aware masking in masked video modeling to enhance the encoding of temporal dynamics.

Findings

01

Outperforms state-of-the-art video self-supervised learning methods on six datasets.

02

Learns more discriminative and generalizable video representations.

03

Improves performance on motion-centric downstream tasks.

Abstract

Masked video modeling (MVM) has emerged as a simple and scalable self-supervised pretraining paradigm, but only encodes motion information implicitly, limiting the encoding of temporal dynamics in the learned representations. As a result, such models struggle on motion-centric tasks that require fine-grained motion awareness. To address this, we propose TrackMAE, a simple masked video modeling paradigm that explicitly uses motion information as a reconstruction signal. In TrackMAE, we use an off-the-shelf point tracker to sparsely track points in the input videos, generating motion trajectories. Furthermore, we exploit the extracted trajectories to improve random tube masking with a motion-aware masking strategy. We enhance video representations learned in both pixel and feature semantic reconstruction spaces by providing a complementary supervision signal in the form of motion targets.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

rvandeghen/TrackMAE
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.