Temporal Alignment Networks for Long-term Video

Tengda Han; Weidi Xie; Andrew Zisserman

arXiv:2204.02968·cs.CV·April 7, 2022

Temporal Alignment Networks for Long-term Video

Tengda Han, Weidi Xie, Andrew Zisserman

PDF

Open Access 1 Repo

TL;DR

This paper introduces a temporal alignment network for long-term videos and text, using a novel co-training method to handle noisy data, achieving state-of-the-art results in alignment and downstream tasks.

Contribution

The paper proposes a new co-training approach for training alignment networks on noisy, weakly aligned instructional videos without manual labels.

Findings

01

Outperforms strong baselines like CLIP and MIL-NCE on alignment tasks.

02

Achieves state-of-the-art results in text-video retrieval and action segmentation.

03

Improves downstream action recognition through end-to-end finetuning.

Abstract

The objective of this paper is a temporal alignment network that ingests long term video sequences, and associated text sentences, in order to: (1) determine if a sentence is alignable with the video; and (2) if it is alignable, then determine its alignment. The challenge is to train such networks from large-scale datasets, such as HowTo100M, where the associated text sentences have significant noise, and are only weakly aligned when relevant. Apart from proposing the alignment network, we also make four contributions: (i) we describe a novel co-training method that enables to denoise and train on raw instructional videos without using manual annotation, despite the considerable noise; (ii) to benchmark the alignment performance, we manually curate a 10-hour subset of HowTo100M, totalling 80 videos, with sparse temporal descriptions. Our proposed model, trained on HowTo100M, outperforms…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

tengdahan/temporalalignnet
pytorch

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHuman Pose and Action Recognition · Multimodal Machine Learning Applications · Video Analysis and Summarization