LoViC: Efficient Long Video Generation with Context Compression

Jiaxiu Jiang; Wenbo Li; Jingjing Ren; Yuping Qiu; Yong Guo; Xiaogang Xu; Han Wu; Wangmeng Zuo

arXiv:2507.12952·cs.CV·July 18, 2025

LoViC: Efficient Long Video Generation with Context Compression

Jiaxiu Jiang, Wenbo Li, Jingjing Ren, Yuping Qiu, Yong Guo, Xiaogang Xu, Han Wu, Wangmeng Zuo

PDF

Open Access

TL;DR

LoViC introduces a scalable diffusion transformer framework for long video generation by employing a novel context compression method, FlexFormer, enabling efficient, coherent, and versatile long video synthesis.

Contribution

The paper presents FlexFormer, a new autoencoder for variable-length video and text compression, and a segment-wise generation process that scales long video synthesis efficiently.

Findings

01

Achieves high-quality long video generation with coherent content.

02

Supports various tasks like prediction, interpolation, and multi-shot generation.

03

Demonstrates effectiveness across diverse open-domain videos.

Abstract

Despite recent advances in diffusion transformers (DiTs) for text-to-video generation, scaling to long-duration content remains challenging due to the quadratic complexity of self-attention. While prior efforts -- such as sparse attention and temporally autoregressive models -- offer partial relief, they often compromise temporal coherence or scalability. We introduce LoViC, a DiT-based framework trained on million-scale open-domain videos, designed to produce long, coherent videos through a segment-wise generation process. At the core of our approach is FlexFormer, an expressive autoencoder that jointly compresses video and text into unified latent representations. It supports variable-length inputs with linearly adjustable compression rates, enabled by a single query token design based on the Q-Former architecture. Additionally, by encoding temporal context through position-aware…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsVideo Analysis and Summarization · Video Coding and Compression Technologies · Advanced Vision and Imaging