Cluster-Wise Spatio-Temporal Masking for Efficient Video-Language Pretraining
Weijun Zhuang, Yuqing Huang, Weikang Meng, Xin Li, Ming Liu, Xiaopeng Hong, Yaowei Wang, Wangmeng Zuo

TL;DR
ClusterSTM introduces a novel cluster-wise spatio-temporal masking strategy for efficient video-language pretraining, effectively capturing holistic and temporally correlated video content while reducing computational costs.
Contribution
The paper proposes ClusterSTM, which uses intra-frame clustering and cluster-wise masking to improve efficiency and semantic understanding in video-language models.
Findings
Achieves state-of-the-art results on multiple benchmarks
Outperforms existing methods in video-text retrieval
Reduces computational costs significantly
Abstract
Large-scale video-language pretraining enables strong generalization across multimodal tasks but often incurs prohibitive computational costs. Although recent advances in masked visual modeling help mitigate this issue, they still suffer from two fundamental limitations: severe visual information loss under high masking ratios and temporal information leakage caused by inter-frame correlations. To address these challenges, we propose ClusterSTM, a Cluster-Wise Spatio-Temporal Masking strategy for efficient video-language pretraining. ClusterSTM first performs intra-frame clustering to partition visual tokens into multiple semantically independent clusters, then conducts cluster-wise masking by retaining the token with the highest temporal density within each cluster. Our masking strategy ensure that the retained tokens capture holistic video content while exhibit strong temporal…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Generative Adversarial Networks and Image Synthesis · Domain Adaptation and Few-Shot Learning
