SANA-WM: Efficient Minute-Scale World Modeling with Hybrid Linear Diffusion Transformer

Haoyi Zhu; Haozhe Liu; Yuyang Zhao; Tian Ye; Junsong Chen; Jincheng Yu; Tong He; Song Han; Enze Xie

arXiv:2605.15178·cs.CV·May 15, 2026

SANA-WM: Efficient Minute-Scale World Modeling with Hybrid Linear Diffusion Transformer

Haoyi Zhu, Haozhe Liu, Yuyang Zhao, Tian Ye, Junsong Chen, Jincheng Yu, Tong He, Song Han, Enze Xie

PDF

2 Models

TL;DR

SANA-WM is an efficient, open-source world model capable of generating high-quality, minute-scale videos with precise camera control, outperforming prior models in efficiency and action-following accuracy.

Contribution

The paper introduces SANA-WM, a novel hybrid linear diffusion transformer architecture that significantly improves efficiency and accuracy in minute-scale world modeling.

Findings

01

SANA-WM achieves comparable visual quality to large industrial baselines.

02

It trains in 15 days on 64 H100 GPUs using only 213K videos.

03

It demonstrates 36x higher throughput than prior open-source models.

Abstract

We introduce SANA-WM, an efficient 2.6B-parameter open-source world model natively trained for one-minute generation, synthesizing high-fidelity, 720p, minute-scale videos with precise camera control. SANA-WM achieves visual quality comparable to large-scale industrial baselines such as LingBot-World and HY-WorldPlay, while significantly improving efficiency. Four core designs drive our architecture: (1) Hybrid Linear Attention combines frame-wise Gated DeltaNet (GDN) with softmax attention for memory-efficient long-context modeling. (2) Dual-Branch Camera Control ensures precise 6-DoF trajectory adherence. (3) Two-Stage Generation Pipeline applies a long-video refiner to stage-1 outputs, improving quality and consistency across sequences. (4) Robust Annotation Pipeline extracts accurate metric-scale 6-DoF camera poses from public videos to yield high-quality, spatiotemporally…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.