Text-Guided Video Masked Autoencoder
David Fan, Jue Wang, Shuai Liao, Zhikang Zhang, Vimal Bhat, Xinyu Li

TL;DR
This paper introduces a novel text-guided masking algorithm for video masked autoencoders that leverages natural language descriptions to improve video representation learning without relying on explicit visual saliency cues.
Contribution
It proposes a new text-guided masking method and a unified framework combining MAE with masked video-text contrastive learning, enhancing downstream video recognition tasks.
Findings
TGM is competitive with motion-guided masking.
Unified framework improves downstream performance.
TGM achieves top results on multiple datasets.
Abstract
Recent video masked autoencoder (MAE) works have designed improved masking algorithms focused on saliency. These works leverage visual cues such as motion to mask the most salient regions. However, the robustness of such visual cues depends on how often input videos match underlying assumptions. On the other hand, natural language description is an information dense representation of video that implicitly captures saliency without requiring modality-specific assumptions, and has not been explored yet for video MAE. To this end, we introduce a novel text-guided masking algorithm (TGM) that masks the video regions with highest correspondence to paired captions. Without leveraging any explicit visual cues for saliency, our TGM is competitive with state-of-the-art masking algorithms such as motion-guided masking. To further benefit from the semantics of natural language for masked…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGenerative Adversarial Networks and Image Synthesis · Digital Media Forensic Detection · Video Analysis and Summarization
MethodsContrastive Learning · Masked autoencoder
