Text-Guided Video Masked Autoencoder

David Fan; Jue Wang; Shuai Liao; Zhikang Zhang; Vimal Bhat; Xinyu Li

arXiv:2408.00759·cs.CV·August 2, 2024

Text-Guided Video Masked Autoencoder

David Fan, Jue Wang, Shuai Liao, Zhikang Zhang, Vimal Bhat, Xinyu Li

PDF

Open Access

TL;DR

This paper introduces a novel text-guided masking algorithm for video masked autoencoders that leverages natural language descriptions to improve video representation learning without relying on explicit visual saliency cues.

Contribution

It proposes a new text-guided masking method and a unified framework combining MAE with masked video-text contrastive learning, enhancing downstream video recognition tasks.

Findings

01

TGM is competitive with motion-guided masking.

02

Unified framework improves downstream performance.

03

TGM achieves top results on multiple datasets.

Abstract

Recent video masked autoencoder (MAE) works have designed improved masking algorithms focused on saliency. These works leverage visual cues such as motion to mask the most salient regions. However, the robustness of such visual cues depends on how often input videos match underlying assumptions. On the other hand, natural language description is an information dense representation of video that implicitly captures saliency without requiring modality-specific assumptions, and has not been explored yet for video MAE. To this end, we introduce a novel text-guided masking algorithm (TGM) that masks the video regions with highest correspondence to paired captions. Without leveraging any explicit visual cues for saliency, our TGM is competitive with state-of-the-art masking algorithms such as motion-guided masking. To further benefit from the semantics of natural language for masked…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGenerative Adversarial Networks and Image Synthesis · Digital Media Forensic Detection · Video Analysis and Summarization

MethodsContrastive Learning · Masked autoencoder