UniVL: A Unified Video and Language Pre-Training Model for Multimodal   Understanding and Generation

Huaishao Luo; Lei Ji; Botian Shi; Haoyang Huang; Nan Duan; Tianrui Li,; Jason Li; Taroon Bharti; Ming Zhou

arXiv:2002.06353·cs.CV·September 16, 2020·169 cites

UniVL: A Unified Video and Language Pre-Training Model for Multimodal Understanding and Generation

Huaishao Luo, Lei Ji, Botian Shi, Haoyang Huang, Nan Duan, Tianrui Li,, Jason Li, Taroon Bharti, Ming Zhou

PDF

Open Access 2 Repos

TL;DR

UniVL is a comprehensive pre-training model that unifies understanding and generation tasks in video-language processing, achieving state-of-the-art results across multiple benchmarks.

Contribution

The paper introduces UniVL, a novel unified pre-training framework with multiple objectives and strategies, enabling effective multimodal understanding and generation.

Findings

01

Achieves state-of-the-art results on five downstream tasks.

02

Effectively learns strong video-text representations.

03

Demonstrates the benefit of unified pre-training for multimodal tasks.

Abstract

With the recent success of the pre-training technique for NLP and image-linguistic tasks, some video-linguistic pre-training works are gradually developed to improve video-text related downstream tasks. However, most of the existing multimodal models are pre-trained for understanding tasks, leading to a pretrain-finetune discrepancy for generation tasks. This paper proposes UniVL: a Unified Video and Language pre-training model for both multimodal understanding and generation. It comprises four components, including two single-modal encoders, a cross encoder, and a decoder with the Transformer backbone. Five objectives, including video-text joint, conditioned masked language model (CMLM), conditioned masked frame model (CMFM), video-text alignment, and language reconstruction, are designed to train each of the components. We further develop two pre-training strategies, stage by stage…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Speech and dialogue systems · Natural Language Processing Techniques

MethodsLinear Layer · Absolute Position Encodings · Position-Wise Feed-Forward Layer · Residual Connection · Attention Dropout · Linear Warmup With Linear Decay · Weight Decay · Byte Pair Encoding · Refunds@Expedia|||How do I get a full refund from Expedia? · Dense Connections