HERO: Hierarchical Encoder for Video+Language Omni-representation   Pre-training

Linjie Li; Yen-Chun Chen; Yu Cheng; Zhe Gan; Licheng Yu; Jingjing Liu

arXiv:2005.00200·cs.CV·October 1, 2020·51 cites

HERO: Hierarchical Encoder for Video+Language Omni-representation Pre-training

Linjie Li, Yen-Chun Chen, Yu Cheng, Zhe Gan, Licheng Yu, Jingjing Liu

PDF

Open Access 3 Repos

TL;DR

HERO is a hierarchical video+language pre-training framework that captures local and global context through novel tasks, achieving state-of-the-art results on multiple video understanding benchmarks.

Contribution

The paper introduces HERO, a hierarchical encoder with new pre-training tasks for improved video+language understanding and benchmarks.

Findings

01

Achieves new state-of-the-art on multiple video understanding benchmarks.

02

Introduces two new challenging benchmarks How2QA and How2R.

03

Demonstrates effectiveness of hierarchical encoding and novel pre-training tasks.

Abstract

We present HERO, a novel framework for large-scale video+language omni-representation learning. HERO encodes multimodal inputs in a hierarchical structure, where local context of a video frame is captured by a Cross-modal Transformer via multimodal fusion, and global video context is captured by a Temporal Transformer. In addition to standard Masked Language Modeling (MLM) and Masked Frame Modeling (MFM) objectives, we design two new pre-training tasks: (i) Video-Subtitle Matching (VSM), where the model predicts both global and local temporal alignment; and (ii) Frame Order Modeling (FOM), where the model predicts the right order of shuffled video frames. HERO is jointly trained on HowTo100M and large-scale TV datasets to gain deep understanding of complex social dynamics with multi-character interactions. Comprehensive experiments demonstrate that HERO achieves new state of the art on…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Human Pose and Action Recognition · Domain Adaptation and Few-Shot Learning

MethodsLinear Layer · Absolute Position Encodings · Position-Wise Feed-Forward Layer · Residual Connection · Byte Pair Encoding · Dense Connections · Label Smoothing · *Communicated@Fast*How Do I Communicate to Expedia? · Adam · Softmax