LinGen: Towards High-Resolution Minute-Length Text-to-Video Generation with Linear Computational Complexity

Hongjie Wang; Chih-Yao Ma; Yen-Cheng Liu; Ji Hou; Tao Xu; Jialiang Wang; Felix Juefei-Xu; Yaqiao Luo; Peizhao Zhang; Tingbo Hou; Peter Vajda; Niraj K. Jha; Xiaoliang Dai

arXiv:2412.09856·cs.CV·May 27, 2025

LinGen: Towards High-Resolution Minute-Length Text-to-Video Generation with Linear Computational Complexity

Hongjie Wang, Chih-Yao Ma, Yen-Cheng Liu, Ji Hou, Tao Xu, Jialiang Wang, Felix Juefei-Xu, Yaqiao Luo, Peizhao Zhang, Tingbo Hou, Peter Vajda, Niraj K. Jha, Xiaoliang Dai

PDF

Open Access

TL;DR

LinGen introduces a linear-complexity framework for high-resolution, minute-length text-to-video generation, enabling efficient, high-quality videos on a single GPU, surpassing previous quadratic-complexity models.

Contribution

The paper presents LinGen, a novel linear-complexity architecture replacing self-attention with MATE, allowing long-duration, high-resolution video generation without quality loss.

Findings

01

LinGen outperforms DiT with 75.6% win rate in video quality.

02

LinGen reduces FLOPs and latency by up to 15×.

03

LinGen achieves comparable quality to state-of-the-art models for long videos.

Abstract

Text-to-video generation enhances content creation but is highly computationally intensive: The computational cost of Diffusion Transformers (DiTs) scales quadratically in the number of pixels. This makes minute-length video generation extremely expensive, limiting most existing models to generating videos of only 10-20 seconds length. We propose a Linear-complexity text-to-video Generation (LinGen) framework whose cost scales linearly in the number of pixels. For the first time, LinGen enables high-resolution minute-length video generation on a single GPU without compromising quality. It replaces the computationally-dominant and quadratic-complexity block, self-attention, with a linear-complexity block called MATE, which consists of an MA-branch and a TE-branch. The MA-branch targets short-to-long-range correlations, combining a bidirectional Mamba2 block with our token rearrangement…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsVideo Analysis and Summarization · Multimedia Communication and Technology · Music and Audio Processing

MethodsSoftmax · Attention Is All You Need · Mamba: Linear-Time Sequence Modeling with Selective State Spaces · Diffusion · MATE