NextStep-1: Toward Autoregressive Image Generation with Continuous Tokens at Scale

NextStep Team: Chunrui Han; Guopeng Li; Jingwei Wu; Quan Sun; Yan Cai; Yuang Peng; Zheng Ge; Deyu Zhou; Haomiao Tang; Hongyu Zhou; Kenkun Liu; Ailin Huang; Bin Wang; Changxin Miao; Deshan Sun; En Yu; Fukun Yin; Gang Yu; Hao Nie; Haoran Lv; Hanpeng Hu; Jia Wang; Jian Zhou; Jianjian Sun; Kaijun Tan; Kang An; Kangheng Lin; Liang Zhao; Mei Chen; Peng Xing; Rui Wang; Shiyu Liu; Shutao Xia; Tianhao You; Wei Ji; Xianfang Zeng; Xin Han; Xuelin Zhang; Yana Wei; Yanming Xu; Yimin Jiang; Yingming Wang; Yu Zhou; Yucheng Han; Ziyang Meng; Binxing Jiao; Daxin Jiang; Xiangyu Zhang; Yibo Zhu

arXiv:2508.10711·cs.CV·August 19, 2025

NextStep-1: Toward Autoregressive Image Generation with Continuous Tokens at Scale

NextStep Team: Chunrui Han, Guopeng Li, Jingwei Wu, Quan Sun, Yan Cai, Yuang Peng, Zheng Ge, Deyu Zhou, Haomiao Tang, Hongyu Zhou, Kenkun Liu, Ailin Huang, Bin Wang, Changxin Miao, Deshan Sun, En Yu, Fukun Yin, Gang Yu, Hao Nie, Haoran Lv, Hanpeng Hu, Jia Wang, Jian Zhou

PDF

6 Models 3 Reviews

TL;DR

NextStep-1 introduces a large-scale autoregressive model that effectively combines discrete text tokens with continuous image tokens, achieving state-of-the-art results in text-to-image generation and editing tasks.

Contribution

The paper presents NextStep-1, a novel autoregressive model that integrates continuous image tokens with discrete text tokens, advancing image synthesis and editing capabilities.

Findings

01

State-of-the-art performance in text-to-image generation

02

Strong image editing capabilities

03

Effective combination of discrete and continuous tokens

Abstract

Prevailing autoregressive (AR) models for text-to-image generation either rely on heavy, computationally-intensive diffusion models to process continuous image tokens, or employ vector quantization (VQ) to obtain discrete tokens with quantization loss. In this paper, we push the autoregressive paradigm forward with NextStep-1, a 14B autoregressive model paired with a 157M flow matching head, training on discrete text tokens and continuous image tokens with next-token prediction objectives. NextStep-1 achieves state-of-the-art performance for autoregressive models in text-to-image generation tasks, exhibiting strong capabilities in high-fidelity image synthesis. Furthermore, our method shows strong performance in image editing, highlighting the power and versatility of our unified approach. To facilitate open research, we will release our code and models to the community.

Peer Reviews

Decision·ICLR 2026 Oral

Reviewer 01Rating 4Confidence 4

Strengths

1.The paper is clearly written and technically sound, presenting a coherent and well-motivated method. 2.The ablation studies are particularly valuable—for instance, the analysis clarifying the relative contributions of the backbone versus the flow-matching head provides useful architectural insights. 3.Figures and tables are of high quality and effectively support the claims made in the text.

Weaknesses

1.Several key technical details remain underspecified. For example, it is unclear whether the autoregressive process decodes one patch at a time and how global consistency is maintained if patches are stitched together. 2.The design of the tokenizer—including its normalization scheme—is described only textually; a schematic illustration would greatly improve clarity. 3.While the paper asserts that “reconstruction quality is the upper bound of generation quality,” it does not explicitly discuss

Reviewer 02Rating 2Confidence 5

Strengths

1. I really appreciate the research taste of the paper. The paper uses some extreme simple method to achieve state-to-the-art performance. 2. The paper can give many take aways to the sequent researchers.

Weaknesses

1. I really suggest the authors to hire someone who is adept at academic paper writting to re-write the whole paper. For instance, the current paper is very unclear. For instance, the introduction is too short. The related work paper is too short without fully respecting the former authors. Given this, I have to lower my score to 4. This is a paper needs major revision. 2. Many parts are unclear. For instance, what GPUs are used in training. The pre-training and post-training sections are also

Reviewer 03Rating 6Confidence 4

Strengths

1. **SOTA Autoregressive Performance**: The paper's primary contribution is demonstrating, for the first time, that an AR model based on continuous tokens (NextStep-1) can achieve SOTA performance on T2I tasks, rivaling top-tier diffusion models. This architecture, combining a large AR Transformer for context prediction with a lightweight FM head for continuous token generation, is proven to be a very successful and promising technical direction. 2. **Deep Insights into Tokenizer and Latent Spa

Weaknesses

1. **Inherent Bottleneck in Inference Latency**: The paper admits in Appendix D and Table A2 that inference latency is a major weakness. The AR sequential decoding is the first bottleneck, and the FM head's multi-step sampling is the second. A 1024-token image requiring 11.31 seconds of accumulated latency (Table A2) is likely far slower in practice than parallel diffusion models. 2. **Significant Challenges in High-Resolution Scaling**: The paper frankly states in Appendix D that the model fac

Code & Models

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.