# HowTo100M: Learning a Text-Video Embedding by Watching Hundred Million   Narrated Video Clips

**Authors:** Antoine Miech, Dimitri Zhukov, Jean-Baptiste Alayrac, Makarand, Tapaswi, Ivan Laptev, Josef Sivic

arXiv: 1906.03327 · 2019-08-01

## TL;DR

This paper introduces HowTo100M, a large-scale dataset of narrated instructional videos used to learn text-video embeddings, achieving state-of-the-art results in retrieval and localization tasks without manual annotation.

## Contribution

The work presents a scalable method for learning text-video embeddings from automatically transcribed narrations, along with a new extensive dataset and demonstrating strong cross-domain transfer capabilities.

## Key findings

- State-of-the-art text-to-video retrieval performance
- Effective action localization in instructional videos
- Good transferability to other video domains

## Abstract

Learning text-video embeddings usually requires a dataset of video clips with manually provided captions. However, such datasets are expensive and time consuming to create and therefore difficult to obtain on a large scale. In this work, we propose instead to learn such embeddings from video data with readily available natural language annotations in the form of automatically transcribed narrations. The contributions of this work are three-fold. First, we introduce HowTo100M: a large-scale dataset of 136 million video clips sourced from 1.22M narrated instructional web videos depicting humans performing and describing over 23k different visual tasks. Our data collection procedure is fast, scalable and does not require any additional manual annotation. Second, we demonstrate that a text-video embedding trained on this data leads to state-of-the-art results for text-to-video retrieval and action localization on instructional video datasets such as YouCook2 or CrossTask. Finally, we show that this embedding transfers well to other domains: fine-tuning on generic Youtube videos (MSR-VTT dataset) and movies (LSMDC dataset) outperforms models trained on these datasets alone. Our dataset, code and models will be publicly available at: www.di.ens.fr/willow/research/howto100m/.

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/1906.03327/full.md

## Figures

9 figures with captions in the complete paper: https://tomesphere.com/paper/1906.03327/full.md

## References

68 references — full list in the complete paper: https://tomesphere.com/paper/1906.03327/full.md

---
Source: https://tomesphere.com/paper/1906.03327