An Empirical Study of End-to-End Video-Language Transformers with Masked   Visual Modeling

Tsu-Jui Fu; Linjie Li; Zhe Gan; Kevin Lin; William Yang; Wang; Lijuan Wang; Zicheng Liu

arXiv:2209.01540·cs.CV·June 2, 2023·6 cites

An Empirical Study of End-to-End Video-Language Transformers with Masked Visual Modeling

Tsu-Jui Fu, Linjie Li, Zhe Gan, Kevin Lin, William Yang, Wang, Lijuan Wang, Zicheng Liu

PDF

Open Access 1 Repo

TL;DR

This paper systematically investigates masked visual modeling (MVM) in video-language pre-training using the VIOLET transformer, exploring various reconstructive targets and demonstrating significant improvements across multiple video-language benchmarks.

Contribution

It introduces a comprehensive study of MVM strategies in VidL pre-training and proposes an enhanced model, VIOLETv2, with improved performance on diverse benchmarks.

Findings

01

VIOLETv2 with MVM outperforms previous models on 13 benchmarks.

02

Different MVM targets have varying impacts on downstream tasks.

03

Effective MVM training significantly boosts VidL task performance.

Abstract

Masked visual modeling (MVM) has been recently proven effective for visual pre-training. While similar reconstructive objectives on video inputs (e.g., masked frame modeling) have been explored in video-language (VidL) pre-training, previous studies fail to find a truly effective MVM strategy that can largely benefit the downstream performance. In this work, we systematically examine the potential of MVM in the context of VidL learning. Specifically, we base our study on a fully end-to-end VIdeO-LanguagE Transformer (VIOLET), where the supervision from MVM training can be backpropagated to the video pixel space. In total, eight different reconstructive targets of MVM are explored, from low-level pixel values and oriented gradients to high-level depth maps, optical flow, discrete visual tokens, and latent visual features. We conduct comprehensive experiments and provide insights into the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

tsujuifu/pytorch_empirical-mvm
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Advanced Neural Network Applications · Domain Adaptation and Few-Shot Learning

MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Layer Normalization · Softmax · Absolute Position Encodings · Residual Connection · Position-Wise Feed-Forward Layer · Adam · Dense Connections