# Large-scale weakly-supervised pre-training for video action recognition

**Authors:** Deepti Ghadiyaram, Matt Feiszli, Du Tran, Xueting Yan, Heng Wang,, Dhruv Mahajan

arXiv: 1905.00561 · 2019-05-03

## TL;DR

This paper demonstrates that large-scale weakly-supervised pre-training on over 65 million web videos significantly advances video action recognition, addressing challenges of noisy labels and dataset construction.

## Contribution

It provides an empirical analysis of large-scale weakly-supervised pre-training for video action recognition and explores optimal dataset construction and pre-training strategies.

## Key findings

- Pre-training on 65 million videos improves state-of-the-art results.
- Constructing verb-object label spaces enhances transfer learning.
- Pre-training for spatio-temporal features benefits action recognition.

## Abstract

Current fully-supervised video datasets consist of only a few hundred thousand videos and fewer than a thousand domain-specific labels. This hinders the progress towards advanced video architectures. This paper presents an in-depth study of using large volumes of web videos for pre-training video models for the task of action recognition. Our primary empirical finding is that pre-training at a very large scale (over 65 million videos), despite on noisy social-media videos and hashtags, substantially improves the state-of-the-art on three challenging public action recognition datasets. Further, we examine three questions in the construction of weakly-supervised video action datasets. First, given that actions involve interactions with objects, how should one construct a verb-object pre-training label space to benefit transfer learning the most? Second, frame-based models perform quite well on action recognition; is pre-training for good image features sufficient or is pre-training for spatio-temporal features valuable for optimal transfer learning? Finally, actions are generally less well-localized in long videos vs. short videos; since action labels are provided at a video level, how should one choose video clips for best performance, given some fixed budget of number or minutes of videos?

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/1905.00561/full.md

## Figures

4 figures with captions in the complete paper: https://tomesphere.com/paper/1905.00561/full.md

## References

73 references — full list in the complete paper: https://tomesphere.com/paper/1905.00561/full.md

---
Source: https://tomesphere.com/paper/1905.00561