Video-Text Representation Learning via Differentiable Weak Temporal Alignment
Dohwan Ko, Joonmyung Choi, Juyeon Ko, Shinyeong Noh, Kyoung-Woon On,, Eun-Sol Kim, Hyunwoo J. Kim

TL;DR
This paper introduces VT-TWINS, a novel self-supervised learning framework that uses a differentiable Dynamic Time Warping approach to effectively learn joint video-text representations from noisy, weakly aligned data, improving performance on downstream tasks.
Contribution
It proposes a differentiable DTW method for better handling weakly correlated data in multi-modal learning, advancing self-supervised video-text embedding techniques.
Findings
Significant improvements in multi-modal representation learning.
Outperforms existing methods on downstream tasks.
Effective handling of noisy, weakly aligned video-text data.
Abstract
Learning generic joint representations for video and text by a supervised method requires a prohibitively substantial amount of manually annotated video datasets. As a practical alternative, a large-scale but uncurated and narrated video dataset, HowTo100M, has recently been introduced. But it is still challenging to learn joint embeddings of video and text in a self-supervised manner, due to its ambiguity and non-sequential alignment. In this paper, we propose a novel multi-modal self-supervised framework Video-Text Temporally Weak Alignment-based Contrastive Learning (VT-TWINS) to capture significant information from noisy and weakly correlated data using a variant of Dynamic Time Warping (DTW). We observe that the standard DTW inherently cannot handle weakly correlated data and only considers the globally optimal alignment path. To address these problems, we develop a differentiable…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsVideo Analysis and Summarization · Time Series Analysis and Forecasting · Music and Audio Processing
MethodsContrastive Learning · Dynamic Time Warping
