VIOLET : End-to-End Video-Language Transformers with Masked Visual-token Modeling
Tsu-Jui Fu, Linjie Li, Zhe Gan, Kevin Lin, William Yang, Wang, Lijuan Wang, Zicheng Liu

TL;DR
VIOLET introduces an end-to-end video-language transformer with explicit temporal modeling and a novel masked visual-token pre-training task, significantly improving performance on video question answering and text-to-video retrieval benchmarks.
Contribution
The paper proposes VIOLET, a fully end-to-end video-language transformer with explicit temporal modeling and a new pre-training task, Masked Visual-token Modeling, enhancing video understanding capabilities.
Findings
Achieves state-of-the-art results on 5 video question answering tasks.
Sets new benchmarks on 4 text-to-video retrieval tasks.
Demonstrates the effectiveness of explicit temporal modeling and MVM pre-training.
Abstract
A great challenge in video-language (VidL) modeling lies in the disconnection between fixed video representations extracted from image/video understanding models and downstream VidL data. Recent studies try to mitigate this disconnection via end-to-end training. To make it computationally feasible, prior works tend to "imagify" video inputs, i.e., a handful of sparsely sampled frames are fed into a 2D CNN, followed by a simple mean-pooling or concatenation to obtain the overall video representations. Although achieving promising results, such simple approaches may lose temporal information that is essential for performing downstream VidL tasks. In this work, we present VIOLET, a fully end-to-end VIdeO-LanguagE Transformer, which adopts a video transformer to explicitly model the temporal dynamics of video inputs. Further, unlike previous studies that found pre-training tasks on video…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Human Pose and Action Recognition · Domain Adaptation and Few-Shot Learning
MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Label Smoothing · Position-Wise Feed-Forward Layer · Absolute Position Encodings · Dense Connections · Softmax · Residual Connection · Adam
