MERLOT: Multimodal Neural Script Knowledge Models
Rowan Zellers, Ximing Lu, Jack Hessel, Youngjae Yu, Jae Sung Park,, Jize Cao, Ali Farhadi, Yejin Choi

TL;DR
MERLOT is a self-supervised multimodal model trained on YouTube videos that learns temporal and contextual understanding, achieving state-of-the-art results in video and image reasoning tasks.
Contribution
Introduces MERLOT, a novel self-supervised multimodal model that learns script knowledge from videos for improved reasoning across visual and temporal contexts.
Findings
Achieves 80.6% accuracy on Visual Commonsense Reasoning.
Outperforms similar-sized models without heavy supervision.
Demonstrates the importance of diverse training objectives.
Abstract
As humans, we understand events in the visual world contextually, performing multimodal reasoning across time to make inferences about the past, present, and future. We introduce MERLOT, a model that learns multimodal script knowledge by watching millions of YouTube videos with transcribed speech -- in an entirely label-free, self-supervised manner. By pretraining with a mix of both frame-level (spatial) and video-level (temporal) objectives, our model not only learns to match images to temporally corresponding words, but also to contextualize what is happening globally over time. As a result, MERLOT exhibits strong out-of-the-box representations of temporal commonsense, and achieves state-of-the-art performance on 12 different video QA datasets when finetuned. It also transfers well to the world of static images, allowing models to reason about the dynamic context behind visual scenes.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsMultimodal Machine Learning Applications · Human Pose and Action Recognition · Domain Adaptation and Few-Shot Learning
