SIGMA: Sinkhorn-Guided Masked Video Modeling
Mohammadreza Salehi, Michael Dorkenwald, Fida Mohammad Thoker,, Efstratios Gavves, Cees G. M. Snoek, Yuki M. Asano

TL;DR
SIGMA introduces a novel masked video modeling approach that uses optimal transport-based clustering to learn semantically rich, temporally-aware video representations, outperforming existing methods across multiple datasets.
Contribution
The paper proposes Sinkhorn-guided clustering for masked video modeling, enabling the learning of high-level semantic features and temporal awareness in video representations.
Findings
SIGMA outperforms state-of-the-art methods on ten datasets.
It produces more robust and temporally-aware video representations.
The approach effectively captures high-level semantics through optimal transport clustering.
Abstract
Video-based pretraining offers immense potential for learning strong visual representations on an unprecedented scale. Recently, masked video modeling methods have shown promising scalability, yet fall short in capturing higher-level semantics due to reconstructing predefined low-level targets such as pixels. To tackle this, we present Sinkhorn-guided Masked Video Modelling (SIGMA), a novel video pretraining method that jointly learns the video model in addition to a target feature space using a projection network. However, this simple modification means that the regular L2 reconstruction loss will lead to trivial solutions as both networks are jointly optimized. As a solution, we distribute features of space-time tubes evenly across a limited number of learnable clusters. By posing this as an optimal transport problem, we enforce high entropy in the generated features across the batch,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsKierkegaardian Philosophy and Influence · Computer Graphics and Visualization Techniques
