Everything is a Video: Unifying Modalities through Next-Frame Prediction
G. Thomas Hudson, Dean Slack, Thomas Winterbottom, Jamie Sterling, Chenghao Xiao, Junjie Shentu, Noura Al Moubayed

TL;DR
This paper introduces a unified framework that reformulates diverse multimodal tasks as next-frame prediction problems, enabling a single model to handle multiple modalities seamlessly and improve generalization across tasks.
Contribution
The paper presents a novel task reformulation approach that unifies multimodal learning into a single next-frame prediction framework, reducing the need for modality-specific components.
Findings
Model generalizes across text, image, audio, and video modalities.
Achieves competitive performance on various multimodal tasks.
Simplifies multimodal model design and enhances transferability.
Abstract
Multimodal learning, which involves integrating information from various modalities such as text, images, audio, and video, is pivotal for numerous complex tasks like visual question answering, cross-modal retrieval, and caption generation. Traditional approaches rely on modality-specific encoders and late fusion techniques, which can hinder scalability and flexibility when adapting to new tasks or modalities. To address these limitations, we introduce a novel framework that extends the concept of task reformulation beyond natural language processing (NLP) to multimodal learning. We propose to reformulate diverse multimodal tasks into a unified next-frame prediction problem, allowing a single model to handle different modalities without modality-specific components. This method treats all inputs and outputs as sequential frames in a video, enabling seamless integration of modalities and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsVideo Analysis and Summarization · Music and Audio Processing
