FSVideo: Fast Speed Video Diffusion Model in a Highly-Compressed Latent Space

FSVideo Team; Qingyu Chen; Zhiyuan Fang; Haibin Huang; Xinwei Huang; Tong Jin; Minxuan Lin; Bo Liu; Celong Liu; Chongyang Ma; Xing Mei; Xiaohui Shen; Yaojie Shen; Fuwen Tan; Angtian Wang; Xiao Yang; Yiding Yang; Jiamin Yuan; Lingxi Zhang; Yuxin Zhang

arXiv:2602.02092·cs.CV·February 3, 2026

FSVideo: Fast Speed Video Diffusion Model in a Highly-Compressed Latent Space

FSVideo Team, Qingyu Chen, Zhiyuan Fang, Haibin Huang, Xinwei Huang, Tong Jin, Minxuan Lin, Bo Liu, Celong Liu, Chongyang Ma, Xing Mei, Xiaohui Shen, Yaojie Shen, Fuwen Tan, Angtian Wang, Xiao Yang, Yiding Yang, Jiamin Yuan, Lingxi Zhang, Yuxin Zhang

PDF

TL;DR

FSVideo introduces a fast transformer-based video diffusion framework utilizing a highly compressed latent space, a novel diffusion transformer architecture, and multi-resolution generation to achieve high-quality videos with significantly improved speed.

Contribution

The paper presents a new video autoencoder with a compressed latent space, a diffusion transformer with enhanced inter-layer communication, and a multi-resolution generation strategy, enabling faster video diffusion.

Findings

01

Achieves competitive video quality with much faster inference speed.

02

Uses a highly compressed latent space for efficient processing.

03

Demonstrates superior performance compared to existing open-source models.

Abstract

We introduce FSVideo, a fast speed transformer-based image-to-video (I2V) diffusion framework. We build our framework on the following key components: 1.) a new video autoencoder with highly-compressed latent space ( $64 \times 64 \times 4$ spatial-temporal downsampling ratio), achieving competitive reconstruction quality; 2.) a diffusion transformer (DIT) architecture with a new layer memory design to enhance inter-layer information flow and context reuse within DIT, and 3.) a multi-resolution generation strategy via a few-step DIT upsampler to increase video fidelity. Our final model, which contains a 14B DIT base model and a 14B DIT upsampler, achieves competitive performance against other popular open-source models, while being an order of magnitude faster. We discuss our model design as well as training strategies in this report.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.