TL;DR
This paper presents a novel weakly supervised approach for learning video representations by aligning temporal sequences using a probabilistic dynamic time warping method that enforces cycle consistency and improves various video understanding tasks.
Contribution
It introduces a probabilistic, contrastive, and differentiable dynamic time warping loss with cycle consistency for temporal sequence alignment in representation learning.
Findings
Significant improvements in action classification accuracy.
Enhanced performance in few-shot learning scenarios.
Effective video synchronization and 3D pose reconstruction.
Abstract
We introduce a weakly supervised method for representation learning based on aligning temporal sequences (e.g., videos) of the same process (e.g., human action). The main idea is to use the global temporal ordering of latent correspondences across sequence pairs as a supervisory signal. In particular, we propose a loss based on scoring the optimal sequence alignment to train an embedding network. Our loss is based on a novel probabilistic path finding view of dynamic time warping (DTW) that contains the following three key features: (i) the local path routing decisions are contrastive and differentiable, (ii) pairwise distances are cast as probabilities that are contrastive as well, and (iii) our formulation naturally admits a global cycle consistency loss that verifies correspondences. For evaluation, we consider the tasks of fine-grained action classification, few shot learning, and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsCycle Consistency Loss
