NextStep-1: Toward Autoregressive Image Generation with Continuous Tokens at Scale
NextStep Team: Chunrui Han, Guopeng Li, Jingwei Wu, Quan Sun, Yan Cai, Yuang Peng, Zheng Ge, Deyu Zhou, Haomiao Tang, Hongyu Zhou, Kenkun Liu, Ailin Huang, Bin Wang, Changxin Miao, Deshan Sun, En Yu, Fukun Yin, Gang Yu, Hao Nie, Haoran Lv, Hanpeng Hu, Jia Wang, Jian Zhou

TL;DR
NextStep-1 introduces a large-scale autoregressive model that effectively combines discrete text tokens with continuous image tokens, achieving state-of-the-art results in text-to-image generation and editing tasks.
Contribution
The paper presents NextStep-1, a novel autoregressive model that integrates continuous image tokens with discrete text tokens, advancing image synthesis and editing capabilities.
Findings
State-of-the-art performance in text-to-image generation
Strong image editing capabilities
Effective combination of discrete and continuous tokens
Abstract
Prevailing autoregressive (AR) models for text-to-image generation either rely on heavy, computationally-intensive diffusion models to process continuous image tokens, or employ vector quantization (VQ) to obtain discrete tokens with quantization loss. In this paper, we push the autoregressive paradigm forward with NextStep-1, a 14B autoregressive model paired with a 157M flow matching head, training on discrete text tokens and continuous image tokens with next-token prediction objectives. NextStep-1 achieves state-of-the-art performance for autoregressive models in text-to-image generation tasks, exhibiting strong capabilities in high-fidelity image synthesis. Furthermore, our method shows strong performance in image editing, highlighting the power and versatility of our unified approach. To facilitate open research, we will release our code and models to the community.
Peer Reviews
Decision·ICLR 2026 Oral
1.The paper is clearly written and technically sound, presenting a coherent and well-motivated method. 2.The ablation studies are particularly valuable—for instance, the analysis clarifying the relative contributions of the backbone versus the flow-matching head provides useful architectural insights. 3.Figures and tables are of high quality and effectively support the claims made in the text.
1.Several key technical details remain underspecified. For example, it is unclear whether the autoregressive process decodes one patch at a time and how global consistency is maintained if patches are stitched together. 2.The design of the tokenizer—including its normalization scheme—is described only textually; a schematic illustration would greatly improve clarity. 3.While the paper asserts that “reconstruction quality is the upper bound of generation quality,” it does not explicitly discuss
1. I really appreciate the research taste of the paper. The paper uses some extreme simple method to achieve state-to-the-art performance. 2. The paper can give many take aways to the sequent researchers.
1. I really suggest the authors to hire someone who is adept at academic paper writting to re-write the whole paper. For instance, the current paper is very unclear. For instance, the introduction is too short. The related work paper is too short without fully respecting the former authors. Given this, I have to lower my score to 4. This is a paper needs major revision. 2. Many parts are unclear. For instance, what GPUs are used in training. The pre-training and post-training sections are also
1. **SOTA Autoregressive Performance**: The paper's primary contribution is demonstrating, for the first time, that an AR model based on continuous tokens (NextStep-1) can achieve SOTA performance on T2I tasks, rivaling top-tier diffusion models. This architecture, combining a large AR Transformer for context prediction with a lightweight FM head for continuous token generation, is proven to be a very successful and promising technical direction. 2. **Deep Insights into Tokenizer and Latent Spa
1. **Inherent Bottleneck in Inference Latency**: The paper admits in Appendix D and Table A2 that inference latency is a major weakness. The AR sequential decoding is the first bottleneck, and the FM head's multi-step sampling is the second. A 1024-token image requiring 11.31 seconds of accumulated latency (Table A2) is likely far slower in practice than parallel diffusion models. 2. **Significant Challenges in High-Resolution Scaling**: The paper frankly states in Appendix D that the model fac
Code & Models
- 🤗stepfun-ai/NextStep-1-Largemodel· 31 dl· ♡ 9831 dl♡ 98
- 🤗stepfun-ai/NextStep-1-Large-Editmodel· 30 dl· ♡ 5030 dl♡ 50
- 🤗stepfun-ai/NextStep-1-Large-Pretrainmodel· 10 dl· ♡ 1810 dl♡ 18
- 🤗stepfun-ai/NextStep-1.1model· 1.5k dl· ♡ 271.5k dl♡ 27
- 🤗stepfun-ai/NextStep-1.1-Pretrainmodel· 17 dl· ♡ 717 dl♡ 7
- 🤗stepfun-ai/NextStep-1.1-Pretrain-256pxmodel· 51 dl· ♡ 1351 dl♡ 13
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
