TL;DR
LongCat-Video is a large, unified diffusion transformer model capable of efficient, high-quality long video generation across multiple tasks, advancing the development of world models.
Contribution
It introduces a versatile, large-scale video generation model supporting multiple tasks with efficient inference and strong performance, including multi-reward RLHF training.
Findings
Supports text-to-video, image-to-video, and video continuation tasks
Generates 720p videos within minutes at 30fps
Achieves performance comparable to leading models
Abstract
Video generation is a critical pathway toward world models, with efficient long video inference as a key capability. Toward this end, we introduce LongCat-Video, a foundational video generation model with 13.6B parameters, delivering strong performance across multiple video generation tasks. It particularly excels in efficient and high-quality long video generation, representing our first step toward world models. Key features include: Unified architecture for multiple tasks: Built on the Diffusion Transformer (DiT) framework, LongCat-Video supports Text-to-Video, Image-to-Video, and Video-Continuation tasks with a single model; Long video generation: Pretraining on Video-Continuation tasks enables LongCat-Video to maintain high quality and temporal coherence in the generation of minutes-long videos; Efficient inference: LongCat-Video generates 720p, 30fps videos within minutes by…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
