TV2TV: A Unified Framework for Interleaved Language and Video Generation

Xiaochuang Han; Youssef Emad; Melissa Hall; John Nguyen; Karthik Padthe; Liam Robbins; Amir Bar; Delong Chen; Michal Drozdzal; Maha Elbayad; Yushi Hu; Shang-Wen Li; Sreya Dutta Roy; Jakob Verbeek; XuDong Wang; Marjan Ghazvininejad; Luke Zettlemoyer; Emily Dinan

arXiv:2512.05103·cs.LG·December 15, 2025

TV2TV: A Unified Framework for Interleaved Language and Video Generation

Xiaochuang Han, Youssef Emad, Melissa Hall, John Nguyen, Karthik Padthe, Liam Robbins, Amir Bar, Delong Chen, Michal Drozdzal, Maha Elbayad, Yushi Hu, Shang-Wen Li, Sreya Dutta Roy, Jakob Verbeek, XuDong Wang, Marjan Ghazvininejad, Luke Zettlemoyer, Emily Dinan

PDF

Open Access

TL;DR

TV2TV introduces a unified interleaved text and video generation framework that enhances visual quality, controllability, and reasoning in complex video outputs by alternating between language and pixel generation.

Contribution

It presents a novel Mixture-of-Transformers architecture for interleaved video-text generation, enabling better reasoning, control, and scalability to real-world videos.

Findings

01

Improved visual quality and prompt alignment in generated videos.

02

Enhanced controllability through text interventions during generation.

03

Successful scaling to natural videos with complex action sequences.

Abstract

Video generation models are rapidly advancing, but can still struggle with complex video outputs that require significant semantic branching or repeated high-level reasoning about what should happen next. In this paper, we introduce a new class of omni video-text models that integrate ideas from recent LM reasoning advances to address this challenge. More specifically, we present TV2TV, a unified generative modeling framework which decomposes video generation into an interleaved text and video generation process. TV2TV jointly learns language modeling (next-token prediction) and video flow matching (next-frame prediction) using a Mixture-of-Transformers (MoT) architecture. At inference time, TV2TV decides when to alternate between generating text and video frames, allowing the model to "think in words" about subsequent content before ``acting in pixels'' to produce frames. This design…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Generative Adversarial Networks and Image Synthesis · Human Motion and Animation