VIOLET : End-to-End Video-Language Transformers with Masked Visual-token   Modeling

Tsu-Jui Fu; Linjie Li; Zhe Gan; Kevin Lin; William Yang; Wang; Lijuan Wang; Zicheng Liu

arXiv:2111.12681·cs.CV·April 19, 2022·89 cites

VIOLET : End-to-End Video-Language Transformers with Masked Visual-token Modeling

Tsu-Jui Fu, Linjie Li, Zhe Gan, Kevin Lin, William Yang, Wang, Lijuan Wang, Zicheng Liu

PDF

Open Access 1 Repo

TL;DR

VIOLET introduces an end-to-end video-language transformer with explicit temporal modeling and a novel masked visual-token pre-training task, significantly improving performance on video question answering and text-to-video retrieval benchmarks.

Contribution

The paper proposes VIOLET, a fully end-to-end video-language transformer with explicit temporal modeling and a new pre-training task, Masked Visual-token Modeling, enhancing video understanding capabilities.

Findings

01

Achieves state-of-the-art results on 5 video question answering tasks.

02

Sets new benchmarks on 4 text-to-video retrieval tasks.

03

Demonstrates the effectiveness of explicit temporal modeling and MVM pre-training.

Abstract

A great challenge in video-language (VidL) modeling lies in the disconnection between fixed video representations extracted from image/video understanding models and downstream VidL data. Recent studies try to mitigate this disconnection via end-to-end training. To make it computationally feasible, prior works tend to "imagify" video inputs, i.e., a handful of sparsely sampled frames are fed into a 2D CNN, followed by a simple mean-pooling or concatenation to obtain the overall video representations. Although achieving promising results, such simple approaches may lose temporal information that is essential for performing downstream VidL tasks. In this work, we present VIOLET, a fully end-to-end VIdeO-LanguagE Transformer, which adopts a video transformer to explicitly model the temporal dynamics of video inputs. Further, unlike previous studies that found pre-training tasks on video…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

tsujuifu/pytorch_violet
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Human Pose and Action Recognition · Domain Adaptation and Few-Shot Learning

MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Label Smoothing · Position-Wise Feed-Forward Layer · Absolute Position Encodings · Dense Connections · Softmax · Residual Connection · Adam