MERLOT Reserve: Neural Script Knowledge through Vision and Language and   Sound

Rowan Zellers; Jiasen Lu; Ximing Lu; Youngjae Yu; Yanpeng; Zhao; Mohammadreza Salehi; Aditya Kusupati; Jack Hessel; Ali; Farhadi; Yejin Choi

arXiv:2201.02639·cs.CV·May 16, 2022

MERLOT Reserve: Neural Script Knowledge through Vision and Language and Sound

Rowan Zellers, Jiasen Lu, Ximing Lu, Youngjae Yu, Yanpeng, Zhao, Mohammadreza Salehi, Aditya Kusupati, Jack Hessel, Ali, Farhadi, Yejin Choi

PDF

TL;DR

MERLOT Reserve is a multimodal video understanding model trained on 20 million YouTube videos, achieving state-of-the-art results in various vision and language tasks through a novel masking training objective that leverages audio, subtitles, and video frames.

Contribution

Introduces MERLOT Reserve, a new multimodal training method that jointly learns from audio, subtitles, and video frames using a masking objective, scaling to large datasets for improved performance.

Findings

01

Sets new state-of-the-art on VCR, TVQA, and Kinetics-600.

02

Audio pretraining enhances performance even on image-centric tasks.

03

Achieves competitive zero-shot results on multiple video understanding benchmarks.

Abstract

As humans, we navigate a multimodal world, building a holistic understanding from all our senses. We introduce MERLOT Reserve, a model that represents videos jointly over time -- through a new training objective that learns from audio, subtitles, and video frames. Given a video, we replace snippets of text and audio with a MASK token; the model learns by choosing the correct masked-out snippet. Our objective learns faster than alternatives, and performs well at scale: we pretrain on 20 million YouTube videos. Empirical results show that MERLOT Reserve learns strong multimodal representations. When finetuned, it sets state-of-the-art on Visual Commonsense Reasoning (VCR), TVQA, and Kinetics-600; outperforming prior work by 5%, 7%, and 1.5% respectively. Ablations show that these tasks benefit from audio pretraining -- even VCR, a QA task centered around images (without sound).…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.