Self-Supervised Contrastive Learning for Videos using Differentiable Local Alignment
Keyne Oei, Amr Gomaa, Anna Maria Feit, Jo\~ao Belo

TL;DR
This paper introduces a self-supervised contrastive learning method for videos that uses differentiable local alignment to improve frame-wise embeddings, outperforming existing methods in action recognition tasks.
Contribution
It proposes a novel Local-Alignment Contrastive (LAC) loss with differentiable local alignment, enabling dynamic adjustment of temporal gap penalties for better video representation learning.
Findings
Outperforms state-of-the-art on action recognition benchmarks
Uses differentiable Smith-Waterman for flexible local alignment
Enhances discriminative frame-level embeddings
Abstract
Robust frame-wise embeddings are essential to perform video analysis and understanding tasks. We present a self-supervised method for representation learning based on aligning temporal video sequences. Our framework uses a transformer-based encoder to extract frame-level features and leverages them to find the optimal alignment path between video sequences. We introduce the novel Local-Alignment Contrastive (LAC) loss, which combines a differentiable local alignment loss to capture local temporal dependencies with a contrastive loss to enhance discriminative learning. Prior works on video alignment have focused on using global temporal ordering across sequence pairs, whereas our loss encourages identifying the best-scoring subsequence alignment. LAC uses the differentiable Smith-Waterman (SW) affine method, which features a flexible parameterization learned through the training phase,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsFace and Expression Recognition · Speech and Audio Processing · Machine Learning and ELM
