Temporal Alignment Networks for Long-term Video
Tengda Han, Weidi Xie, Andrew Zisserman

TL;DR
This paper introduces a temporal alignment network for long-term videos and text, using a novel co-training method to handle noisy data, achieving state-of-the-art results in alignment and downstream tasks.
Contribution
The paper proposes a new co-training approach for training alignment networks on noisy, weakly aligned instructional videos without manual labels.
Findings
Outperforms strong baselines like CLIP and MIL-NCE on alignment tasks.
Achieves state-of-the-art results in text-video retrieval and action segmentation.
Improves downstream action recognition through end-to-end finetuning.
Abstract
The objective of this paper is a temporal alignment network that ingests long term video sequences, and associated text sentences, in order to: (1) determine if a sentence is alignable with the video; and (2) if it is alignable, then determine its alignment. The challenge is to train such networks from large-scale datasets, such as HowTo100M, where the associated text sentences have significant noise, and are only weakly aligned when relevant. Apart from proposing the alignment network, we also make four contributions: (i) we describe a novel co-training method that enables to denoise and train on raw instructional videos without using manual annotation, despite the considerable noise; (ii) to benchmark the alignment performance, we manually curate a 10-hour subset of HowTo100M, totalling 80 videos, with sparse temporal descriptions. Our proposed model, trained on HowTo100M, outperforms…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHuman Pose and Action Recognition · Multimodal Machine Learning Applications · Video Analysis and Summarization
