Video-Text Representation Learning via Differentiable Weak Temporal   Alignment

Dohwan Ko; Joonmyung Choi; Juyeon Ko; Shinyeong Noh; Kyoung-Woon On,; Eun-Sol Kim; Hyunwoo J. Kim

arXiv:2203.16784·cs.CV·April 1, 2022·1 cites

Video-Text Representation Learning via Differentiable Weak Temporal Alignment

Dohwan Ko, Joonmyung Choi, Juyeon Ko, Shinyeong Noh, Kyoung-Woon On,, Eun-Sol Kim, Hyunwoo J. Kim

PDF

Open Access 1 Repo

TL;DR

This paper introduces VT-TWINS, a novel self-supervised learning framework that uses a differentiable Dynamic Time Warping approach to effectively learn joint video-text representations from noisy, weakly aligned data, improving performance on downstream tasks.

Contribution

It proposes a differentiable DTW method for better handling weakly correlated data in multi-modal learning, advancing self-supervised video-text embedding techniques.

Findings

01

Significant improvements in multi-modal representation learning.

02

Outperforms existing methods on downstream tasks.

03

Effective handling of noisy, weakly aligned video-text data.

Abstract

Learning generic joint representations for video and text by a supervised method requires a prohibitively substantial amount of manually annotated video datasets. As a practical alternative, a large-scale but uncurated and narrated video dataset, HowTo100M, has recently been introduced. But it is still challenging to learn joint embeddings of video and text in a self-supervised manner, due to its ambiguity and non-sequential alignment. In this paper, we propose a novel multi-modal self-supervised framework Video-Text Temporally Weak Alignment-based Contrastive Learning (VT-TWINS) to capture significant information from noisy and weakly correlated data using a variant of Dynamic Time Warping (DTW). We observe that the standard DTW inherently cannot handle weakly correlated data and only considers the globally optimal alignment path. To address these problems, we develop a differentiable…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

mlvlab/vt-twins
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsVideo Analysis and Summarization · Time Series Analysis and Forecasting · Music and Audio Processing

MethodsContrastive Learning · Dynamic Time Warping