Autoregressive Models Rival Diffusion Models at ANY-ORDER Generation
Tianqi Du, Lizhe Fang, Weijie Yang, Chenheng Zhang, Zeming Wei, Yifei Wang, Yisen Wang

TL;DR
This paper introduces A3, a generalized autoregressive framework that extends standard AR models to arbitrary token groups and orders, combining their strengths with diffusion models' flexibility for improved language generation.
Contribution
A3 reformulates autoregressive modeling into a structured multi-group prediction process, enabling flexible, parallel, and bidirectional generation while maintaining probabilistic rigor.
Findings
A3 outperforms diffusion models in question answering, reasoning, and story infilling tasks.
A3 maintains flexible decoding with improved sample quality and stability.
The framework effectively transitions pretrained AR models to any-order prediction.
Abstract
Diffusion language models enable any-order generation and bidirectional conditioning, offering appealing flexibility for tasks such as infilling, rewriting, and self-correction. However, their formulation-predicting one part of a sequence from another within a single-step dependency-limits modeling depth and often yields lower sample quality and stability than autoregressive (AR) models. To address this, we revisit autoregressive modeling as a foundation and reformulate diffusion-style training into a structured multi-group prediction process. We propose Any-order Any-subset Autoregressive modeling (A3), a generalized framework that extends the standard AR factorization to arbitrary token groups and generation orders. A3 preserves the probabilistic rigor and multi-layer dependency modeling of AR while inheriting diffusion models' flexibility for parallel and bidirectional generation. We…
Peer Reviews
Decision·ICLR 2026 Poster
1. Insightful formulation. The paper presents a very interesting perspective on any-order, any-subset autoregressive modeling, effectively bridging the strengths of AR and masked diffusion models in a unified probabilistic framework. 2. Clarity and organization. The paper is well written and easy to follow, with clear explanations, sound motivation, and well-structured methodology. 3. Experiments across multiple reasoning and generation benchmarks validate the effectiveness of the proposed app
1. What's new relative to prior AR generalizations? The author mentioned that the proposed method builds closely on existing ideas from XLNet (permutation-based AR) and masked diffusion modeling, with the main difference being a unified training/inference view. While conceptually elegant, the contribution may feel incremental rather than fundamentally new. 2. Evaluation scope and ablations are limited. Experiments mainly compare against diffusion-style baselines; there is less analysis against
1. The paper focus a critical problem, which addresses a gap in sequence generation—reconciling AR’s stability/quality with diffusion’s flexibility/parallelism—aligned with real-world needs (long-context generation, infilling). 2. Explicitly outperforms AR models in infilling (ROCStories) by utilizing bidirectional context, validating its flexibility.
1. Fails to cite or discuss existing NAR research on Block-wise generation (e.g., Block-AR models that split sequences into fixed/masked blocks for parallel prediction) and progressive training (e.g., curriculum-based NAR training that increments block size or relaxes order constraints). This gap obscures A3’s incremental innovation—readers cannot distinguish whether A3’s groupwise/progressive designs are novel or iterative improvements on prior NAR work. 2. No controlled experiment comparing A
1. Clear target: AR-level stability + arbitrary-order generation. The paper correctly identifies a real and currently active gap: classic AR models are stable and likelihood-faithful but order-rigid, while discrete diffusion / iterative masked LM are order-flexible but multi-step, hyperparameter-sensitive, and sometimes harder to train. Proposing a single framework that “looks like AR to the optimizer” but “behaves like an infiller” is a sensible and timely goal. 2. Groupwise factorization is a
1. Parallel generation is only *heuristically* correct, not *distributionally* correct. The paper’s decoding story (“decode some groups, resample the uncertain ones”) is an engineering compromise, but it does not give the kind of provable, joint-distribution-correct parallel sampling that very recent any-subset AR work is starting to provide (e.g. ASSD in Guo & Ermon 2025) — those works explicitly address the mismatch between parallel predictions and the target joint, while A3 largely sidesteps
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Hate Speech and Cyberbullying Detection
