Motif-Video 2B: Technical Report

Junghwan Lim; Wai Ting Cheung; Minsu Ha; Beomgyu Kim; Taewhan Kim; Haesol Lee; Dongpin Oh; Jeesoo Lee; Taehyun Kim; Minjae Kim; Sungmin Lee; Hyeyeon Cho; Dahye Choi; Jaeheui Her; Jaeyeon Huh; Hanbin Jung; Changjin Kang; Dongseok Kim; Jangwoong Kim; Youngrok Kim; Hyukjin Kweon; Hongjoo Lee; Jeongdoo Lee; Junhyeok Lee; Eunhwan Park; Yeongjae Park; Bokki Ryu; Dongjoo Weon

arXiv:2604.16503·cs.CV·May 20, 2026

Motif-Video 2B: Technical Report

Junghwan Lim, Wai Ting Cheung, Minsu Ha, Beomgyu Kim, Taewhan Kim, Haesol Lee, Dongpin Oh, Jeesoo Lee, Taehyun Kim, Minjae Kim, Sungmin Lee, Hyeyeon Cho, Dahye Choi, Jaeheui Her, Jaeyeon Huh, Hanbin Jung, Changjin Kang, Dongseok Kim, Jangwoong Kim, Youngrok Kim, Hyukjin Kweon

PDF

3 Models

TL;DR

Motif-Video 2B demonstrates that architectural specialization and efficient training enable high-quality text-to-video generation with significantly less data and compute than larger models.

Contribution

The paper introduces a novel architecture that separates roles in video generation, paired with an efficient training method, achieving competitive results with fewer resources.

Findings

01

Achieves 83.76% on VBench, surpassing larger models

02

Uses 7× fewer parameters and less training data

03

Develops clearer cross-frame attention structures

Abstract

Training strong video generation models usually requires massive datasets, large parameter counts, and substantial compute. In this work, we ask whether strong text-to-video quality is possible at a much smaller budget: fewer than 10M clips and less than 100,000 H200 GPU hours. Our core claim is that part of the answer lies in how model capacity is organized, not only in how much of it is used. In video generation, prompt alignment, temporal consistency, and fine-detail recovery can interfere with one another when they are handled through the same pathway. Motif-Video 2B addresses this by separating these roles architecturally, rather than relying on scale alone. The model combines two key ideas. First, Shared Cross-Attention strengthens text control when video token sequences become long. Second, a three-part backbone separates early fusion, joint representation learning, and detail…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.