MERLOT: Multimodal Neural Script Knowledge Models

Rowan Zellers; Ximing Lu; Jack Hessel; Youngjae Yu; Jae Sung Park,; Jize Cao; Ali Farhadi; Yejin Choi

arXiv:2106.02636·cs.CV·October 25, 2021·54 cites

MERLOT: Multimodal Neural Script Knowledge Models

Rowan Zellers, Ximing Lu, Jack Hessel, Youngjae Yu, Jae Sung Park,, Jize Cao, Ali Farhadi, Yejin Choi

PDF

Open Access 1 Repo 4 Models 1 Video

TL;DR

MERLOT is a self-supervised multimodal model trained on YouTube videos that learns temporal and contextual understanding, achieving state-of-the-art results in video and image reasoning tasks.

Contribution

Introduces MERLOT, a novel self-supervised multimodal model that learns script knowledge from videos for improved reasoning across visual and temporal contexts.

Findings

01

Achieves 80.6% accuracy on Visual Commonsense Reasoning.

02

Outperforms similar-sized models without heavy supervision.

03

Demonstrates the importance of diverse training objectives.

Abstract

As humans, we understand events in the visual world contextually, performing multimodal reasoning across time to make inferences about the past, present, and future. We introduce MERLOT, a model that learns multimodal script knowledge by watching millions of YouTube videos with transcribed speech -- in an entirely label-free, self-supervised manner. By pretraining with a mix of both frame-level (spatial) and video-level (temporal) objectives, our model not only learns to match images to temporally corresponding words, but also to contextualize what is happening globally over time. As a result, MERLOT exhibits strong out-of-the-box representations of temporal commonsense, and achieves state-of-the-art performance on 12 different video QA datasets when finetuned. It also transfers well to the world of static images, allowing models to reason about the dynamic context behind visual scenes.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

rowanz/merlot
tf

Models

Videos

MERLOT: Multimodal Neural Script Knowledge Models· slideslive

Taxonomy

TopicsMultimodal Machine Learning Applications · Human Pose and Action Recognition · Domain Adaptation and Few-Shot Learning