Cluster-Wise Spatio-Temporal Masking for Efficient Video-Language Pretraining

Weijun Zhuang; Yuqing Huang; Weikang Meng; Xin Li; Ming Liu; Xiaopeng Hong; Yaowei Wang; Wangmeng Zuo

arXiv:2603.22953·cs.CV·March 25, 2026

Cluster-Wise Spatio-Temporal Masking for Efficient Video-Language Pretraining

Weijun Zhuang, Yuqing Huang, Weikang Meng, Xin Li, Ming Liu, Xiaopeng Hong, Yaowei Wang, Wangmeng Zuo

PDF

Open Access

TL;DR

ClusterSTM introduces a novel cluster-wise spatio-temporal masking strategy for efficient video-language pretraining, effectively capturing holistic and temporally correlated video content while reducing computational costs.

Contribution

The paper proposes ClusterSTM, which uses intra-frame clustering and cluster-wise masking to improve efficiency and semantic understanding in video-language models.

Findings

01

Achieves state-of-the-art results on multiple benchmarks

02

Outperforms existing methods in video-text retrieval

03

Reduces computational costs significantly

Abstract

Large-scale video-language pretraining enables strong generalization across multimodal tasks but often incurs prohibitive computational costs. Although recent advances in masked visual modeling help mitigate this issue, they still suffer from two fundamental limitations: severe visual information loss under high masking ratios and temporal information leakage caused by inter-frame correlations. To address these challenges, we propose ClusterSTM, a Cluster-Wise Spatio-Temporal Masking strategy for efficient video-language pretraining. ClusterSTM first performs intra-frame clustering to partition visual tokens into multiple semantically independent clusters, then conducts cluster-wise masking by retaining the token with the highest temporal density within each cluster. Our masking strategy ensure that the retained tokens capture holistic video content while exhibit strong temporal…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Generative Adversarial Networks and Image Synthesis · Domain Adaptation and Few-Shot Learning