TokensGen: Harnessing Condensed Tokens for Long Video Generation

Wenqi Ouyang; Zeqi Xiao; Danni Yang; Yifan Zhou; Shuai Yang; Lei Yang; Jianlou Si; Xingang Pan

arXiv:2507.15728·cs.CV·July 22, 2025

TokensGen: Harnessing Condensed Tokens for Long Video Generation

Wenqi Ouyang, Zeqi Xiao, Danni Yang, Yifan Zhou, Shuai Yang, Lei Yang, Jianlou Si, Xingang Pan

PDF

TL;DR

TokensGen introduces a two-stage framework using condensed tokens and a diffusion transformer to generate long videos with improved temporal coherence and smooth transitions, addressing memory and consistency challenges.

Contribution

The paper presents a novel approach combining token condensation and a diffusion transformer for scalable, coherent long video generation, which is a significant advancement over existing short clip models.

Findings

01

Enhanced long-term temporal coherence

02

Improved inter-clip smoothness

03

Reduced boundary artifacts

Abstract

Generating consistent long videos is a complex challenge: while diffusion-based generative models generate visually impressive short clips, extending them to longer durations often leads to memory bottlenecks and long-term inconsistency. In this paper, we propose TokensGen, a novel two-stage framework that leverages condensed tokens to address these issues. Our method decomposes long video generation into three core tasks: (1) inner-clip semantic control, (2) long-term consistency control, and (3) inter-clip smooth transition. First, we train To2V (Token-to-Video), a short video diffusion model guided by text and video tokens, with a Video Tokenizer that condenses short clips into semantically rich tokens. Second, we introduce T2To (Text-to-Token), a video token diffusion transformer that generates all tokens at once, ensuring global consistency across clips. Finally, during inference,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.