Reinforcement Learning meets Masked Video Modeling : Trajectory-Guided Adaptive Token Selection

Ayush K. Rai; Kyle Min; Tarun Krishna; Feiyan Hu; Alan F. Smeaton; Noel E. O'Connor

arXiv:2505.08561·cs.CV·August 15, 2025

Reinforcement Learning meets Masked Video Modeling : Trajectory-Guided Adaptive Token Selection

Ayush K. Rai, Kyle Min, Tarun Krishna, Feiyan Hu, Alan F. Smeaton, Noel E. O'Connor

PDF

TL;DR

This paper introduces TATS, a trajectory-aware token sampler for masked video modeling, which improves action recognition performance by adaptively selecting motion-centric tokens during pre-training.

Contribution

The work proposes a novel TATS method that models token motion dynamics and integrates with MAE, enabling efficient, adaptive, and motion-focused token selection in video pre-training.

Findings

01

TATS enables aggressive masking without performance loss.

02

The approach improves transferability and generalization across benchmarks.

03

It achieves state-of-the-art results on multiple video action recognition datasets.

Abstract

Masked video modeling~(MVM) has emerged as a highly effective pre-training strategy for visual foundation models, whereby the model reconstructs masked spatiotemporal tokens using information from visible tokens. However, a key challenge in such approaches lies in selecting an appropriate masking strategy. Previous studies have explored predefined masking techniques, including random and tube-based masking, as well as approaches that leverage key motion priors, optical flow and semantic cues from externally pre-trained models. In this work, we introduce a novel and generalizable Trajectory-Aware Adaptive Token Sampler (TATS), which models the motion dynamics of tokens and can be seamlessly integrated into the masked autoencoder (MAE) framework to select motion-centric tokens in videos. Additionally, we propose a unified training strategy that enables joint optimization of both MAE and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

MethodsMasked autoencoder