Self-Supervised Contrastive Learning for Videos using Differentiable   Local Alignment

Keyne Oei; Amr Gomaa; Anna Maria Feit; Jo\~ao Belo

arXiv:2409.04607·cs.CV·March 4, 2025

Self-Supervised Contrastive Learning for Videos using Differentiable Local Alignment

Keyne Oei, Amr Gomaa, Anna Maria Feit, Jo\~ao Belo

PDF

Open Access 1 Repo

TL;DR

This paper introduces a self-supervised contrastive learning method for videos that uses differentiable local alignment to improve frame-wise embeddings, outperforming existing methods in action recognition tasks.

Contribution

It proposes a novel Local-Alignment Contrastive (LAC) loss with differentiable local alignment, enabling dynamic adjustment of temporal gap penalties for better video representation learning.

Findings

01

Outperforms state-of-the-art on action recognition benchmarks

02

Uses differentiable Smith-Waterman for flexible local alignment

03

Enhances discriminative frame-level embeddings

Abstract

Robust frame-wise embeddings are essential to perform video analysis and understanding tasks. We present a self-supervised method for representation learning based on aligning temporal video sequences. Our framework uses a transformer-based encoder to extract frame-level features and leverages them to find the optimal alignment path between video sequences. We introduce the novel Local-Alignment Contrastive (LAC) loss, which combines a differentiable local alignment loss to capture local temporal dependencies with a contrastive loss to enhance discriminative learning. Prior works on video alignment have focused on using global temporal ordering across sequence pairs, whereas our loss encourages identifying the best-scoring subsequence alignment. LAC uses the differentiable Smith-Waterman (SW) affine method, which features a flexible parameterization learned through the training phase,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

keynekassapa13/LAC
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsFace and Expression Recognition · Speech and Audio Processing · Machine Learning and ELM