SemanticGen: Video Generation in Semantic Space

Jianhong Bai; Xiaoshi Wu; Xintao Wang; Xiao Fu; Yuanxing Zhang; Qinghe Wang; Xiaoyu Shi; Menghan Xia; Zuozhu Liu; Haoji Hu; Pengfei Wan; Kun Gai

arXiv:2512.20619·cs.CV·December 29, 2025

SemanticGen: Video Generation in Semantic Space

Jianhong Bai, Xiaoshi Wu, Xintao Wang, Xiao Fu, Yuanxing Zhang, Qinghe Wang, Xiaoyu Shi, Menghan Xia, Zuozhu Liu, Haoji Hu, Pengfei Wan, Kun Gai

PDF

Open Access

TL;DR

SemanticGen introduces a two-stage video generation method in semantic space, enabling faster, more efficient, and high-quality long video synthesis by focusing on high-level semantics before detailed pixel-level rendering.

Contribution

It proposes a novel semantic space-based video generation framework that improves efficiency and quality over traditional pixel-space methods.

Findings

01

Faster convergence compared to VAE latent space methods

02

Effective long video generation with high quality

03

Outperforms state-of-the-art approaches

Abstract

State-of-the-art video generative models typically learn the distribution of video latents in the VAE space and map them to pixels using a VAE decoder. While this approach can generate high-quality videos, it suffers from slow convergence and is computationally expensive when generating long videos. In this paper, we introduce SemanticGen, a novel solution to address these limitations by generating videos in the semantic space. Our main insight is that, due to the inherent redundancy in videos, the generation process should begin in a compact, high-level semantic space for global planning, followed by the addition of high-frequency details, rather than directly modeling a vast set of low-level video tokens using bi-directional attention. SemanticGen adopts a two-stage generation process. In the first stage, a diffusion model generates compact semantic video features, which define the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGenerative Adversarial Networks and Image Synthesis · Multimodal Machine Learning Applications · Human Motion and Animation