MERLOT Reserve: Neural Script Knowledge through Vision and Language and Sound
Rowan Zellers, Jiasen Lu, Ximing Lu, Youngjae Yu, Yanpeng, Zhao, Mohammadreza Salehi, Aditya Kusupati, Jack Hessel, Ali, Farhadi, Yejin Choi

TL;DR
MERLOT Reserve is a multimodal video understanding model trained on 20 million YouTube videos, achieving state-of-the-art results in various vision and language tasks through a novel masking training objective that leverages audio, subtitles, and video frames.
Contribution
Introduces MERLOT Reserve, a new multimodal training method that jointly learns from audio, subtitles, and video frames using a masking objective, scaling to large datasets for improved performance.
Findings
Sets new state-of-the-art on VCR, TVQA, and Kinetics-600.
Audio pretraining enhances performance even on image-centric tasks.
Achieves competitive zero-shot results on multiple video understanding benchmarks.
Abstract
As humans, we navigate a multimodal world, building a holistic understanding from all our senses. We introduce MERLOT Reserve, a model that represents videos jointly over time -- through a new training objective that learns from audio, subtitles, and video frames. Given a video, we replace snippets of text and audio with a MASK token; the model learns by choosing the correct masked-out snippet. Our objective learns faster than alternatives, and performs well at scale: we pretrain on 20 million YouTube videos. Empirical results show that MERLOT Reserve learns strong multimodal representations. When finetuned, it sets state-of-the-art on Visual Commonsense Reasoning (VCR), TVQA, and Kinetics-600; outperforming prior work by 5%, 7%, and 1.5% respectively. Ablations show that these tasks benefit from audio pretraining -- even VCR, a QA task centered around images (without sound).…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
