Learning by Aligning Videos in Time
Sanjay Haresh, Sateesh Kumar, Huseyin Coskun, Shahram Najam, Syed, Andrey Konin, Muhammad Zeeshan Zia, Quoc-Huy Tran

TL;DR
This paper introduces a self-supervised video representation learning method that aligns videos in time using a novel combination of temporal alignment and regularization, improving performance on various action understanding tasks especially with limited labeled data.
Contribution
The paper proposes a new self-supervised approach combining Soft-DTW and Contrastive-IDM for temporal video alignment, addressing trivial solutions and enhancing video representation learning.
Findings
Outperforms state-of-the-art self-supervised methods on multiple datasets.
Improves action phase classification and retrieval tasks.
Provides significant gains with limited labeled data.
Abstract
We present a self-supervised approach for learning video representations using temporal video alignment as a pretext task, while exploiting both frame-level and video-level information. We leverage a novel combination of temporal alignment loss and temporal regularization terms, which can be used as supervision signals for training an encoder network. Specifically, the temporal alignment loss (i.e., Soft-DTW) aims for the minimum cost for temporally aligning videos in the embedding space. However, optimizing solely for this term leads to trivial solutions, particularly, one where all frames get mapped to a small cluster in the embedding space. To overcome this problem, we propose a temporal regularization term (i.e., Contrastive-IDM) which encourages different frames to be mapped to different points in the embedding space. Extensive evaluations on various tasks, including action phase…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
