Learnable Motion-Focused Tokenization for Effective and Efficient Video Unsupervised Domain Adaptation
Tzu Ling Liu, Ian Stavness, Mrigank Rochan

TL;DR
This paper introduces LMFT, a learnable tokenization method that focuses on motion-rich regions in videos to improve unsupervised domain adaptation for action recognition, achieving state-of-the-art results efficiently.
Contribution
The paper proposes a novel learnable motion-focused tokenization approach that discards background tokens, enhancing VUDA performance and computational efficiency.
Findings
Achieves state-of-the-art performance on three VUDA benchmarks.
Reduces computational overhead significantly.
Effectively discards background tokens, focusing on motion-rich regions.
Abstract
Video Unsupervised Domain Adaptation (VUDA) poses a significant challenge in action recognition, requiring the adaptation of a model from a labeled source domain to an unlabeled target domain. Despite recent advances, existing VUDA methods often fall short of fully supervised performance, a key reason being the prevalence of static and uninformative backgrounds that exacerbate domain shifts. Additionally, prior approaches largely overlook computational efficiency, limiting real-world adoption. To address these issues, we propose Learnable Motion-Focused Tokenization (LMFT) for VUDA. LMFT tokenizes video frames into patch tokens and learns to discard low-motion, redundant tokens, primarily corresponding to background regions, while retaining motion-rich, action-relevant tokens for adaptation. Extensive experiments on three standard VUDA benchmarks across 21 domain adaptation settings…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
