An Empirical Study of End-to-End Video-Language Transformers with Masked Visual Modeling
Tsu-Jui Fu, Linjie Li, Zhe Gan, Kevin Lin, William Yang, Wang, Lijuan Wang, Zicheng Liu

TL;DR
This paper systematically investigates masked visual modeling (MVM) in video-language pre-training using the VIOLET transformer, exploring various reconstructive targets and demonstrating significant improvements across multiple video-language benchmarks.
Contribution
It introduces a comprehensive study of MVM strategies in VidL pre-training and proposes an enhanced model, VIOLETv2, with improved performance on diverse benchmarks.
Findings
VIOLETv2 with MVM outperforms previous models on 13 benchmarks.
Different MVM targets have varying impacts on downstream tasks.
Effective MVM training significantly boosts VidL task performance.
Abstract
Masked visual modeling (MVM) has been recently proven effective for visual pre-training. While similar reconstructive objectives on video inputs (e.g., masked frame modeling) have been explored in video-language (VidL) pre-training, previous studies fail to find a truly effective MVM strategy that can largely benefit the downstream performance. In this work, we systematically examine the potential of MVM in the context of VidL learning. Specifically, we base our study on a fully end-to-end VIdeO-LanguagE Transformer (VIOLET), where the supervision from MVM training can be backpropagated to the video pixel space. In total, eight different reconstructive targets of MVM are explored, from low-level pixel values and oriented gradients to high-level depth maps, optical flow, discrete visual tokens, and latent visual features. We conduct comprehensive experiments and provide insights into the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Neural Network Applications · Domain Adaptation and Few-Shot Learning
MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Layer Normalization · Softmax · Absolute Position Encodings · Residual Connection · Position-Wise Feed-Forward Layer · Adam · Dense Connections
