TL;DR
SANA-WM is an efficient, open-source world model capable of generating high-quality, minute-scale videos with precise camera control, outperforming prior models in efficiency and action-following accuracy.
Contribution
The paper introduces SANA-WM, a novel hybrid linear diffusion transformer architecture that significantly improves efficiency and accuracy in minute-scale world modeling.
Findings
SANA-WM achieves comparable visual quality to large industrial baselines.
It trains in 15 days on 64 H100 GPUs using only 213K videos.
It demonstrates 36x higher throughput than prior open-source models.
Abstract
We introduce SANA-WM, an efficient 2.6B-parameter open-source world model natively trained for one-minute generation, synthesizing high-fidelity, 720p, minute-scale videos with precise camera control. SANA-WM achieves visual quality comparable to large-scale industrial baselines such as LingBot-World and HY-WorldPlay, while significantly improving efficiency. Four core designs drive our architecture: (1) Hybrid Linear Attention combines frame-wise Gated DeltaNet (GDN) with softmax attention for memory-efficient long-context modeling. (2) Dual-Branch Camera Control ensures precise 6-DoF trajectory adherence. (3) Two-Stage Generation Pipeline applies a long-video refiner to stage-1 outputs, improving quality and consistency across sequences. (4) Robust Annotation Pipeline extracts accurate metric-scale 6-DoF camera poses from public videos to yield high-quality, spatiotemporally…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
