ContentV: Efficient Training of Video Generation Models with Limited Compute

Wenfeng Lin; Renjie Chen; Boyuan Liu; Shiyue Yan; Ruoyu Feng; Jiangchuan Wei; Yichen Zhang; Yimeng Zhou; Chao Feng; Jiao Ran; Qi Wu; Zuotao Liu; Mingyu Guo

arXiv:2506.05343·cs.CV·June 12, 2025

ContentV: Efficient Training of Video Generation Models with Limited Compute

Wenfeng Lin, Renjie Chen, Boyuan Liu, Shiyue Yan, Ruoyu Feng, Jiangchuan Wei, Yichen Zhang, Yimeng Zhou, Chao Feng, Jiao Ran, Qi Wu, Zuotao Liu, Mingyu Guo

PDF

Open Access 1 Models

TL;DR

ContentV is a highly efficient text-to-video model that achieves state-of-the-art results with significantly reduced computational resources by leveraging innovative architecture, training strategies, and reinforcement learning techniques.

Contribution

The paper introduces ContentV, a novel 8B-parameter text-to-video model that combines a minimalist architecture, multi-stage training, and reinforcement learning to enable high-quality video generation with limited compute.

Findings

01

Achieves 85.14 on VBench with 4 weeks of training on 256 NPUs

02

Generates diverse, high-quality videos across multiple resolutions and durations

03

Reduces computational costs significantly compared to previous models

Abstract

Recent advances in video generation demand increasingly efficient training recipes to mitigate escalating computational costs. In this report, we present ContentV, an 8B-parameter text-to-video model that achieves state-of-the-art performance (85.14 on VBench) after training on 256 x 64GB Neural Processing Units (NPUs) for merely four weeks. ContentV generates diverse, high-quality videos across multiple resolutions and durations from text prompts, enabled by three key innovations: (1) A minimalist architecture that maximizes reuse of pre-trained image generation models for video generation; (2) A systematic multi-stage training strategy leveraging flow matching for enhanced efficiency; and (3) A cost-effective reinforcement learning with human feedback framework that improves generation quality without requiring additional human annotations. All the code and models are available at:…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

🤗
ByteDance/ContentV-8B
model· 30 dl· ♡ 55
30 dl♡ 55

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Generative Adversarial Networks and Image Synthesis · Domain Adaptation and Few-Shot Learning