Any-Order GPT as Masked Diffusion Model: Decoupling Formulation and Architecture
Shuchen Xue, Tianyu Xie, Tianyang Hu, Zijin Feng, Jiacheng Sun, Kenji Kawaguchi, Zhenguo Li, Zhi-Ming Ma

TL;DR
This paper compares masked diffusion models and autoregressive models within a unified decoder-only framework, revealing architectural trade-offs and proposing refinements to the permutation-based training objective for large language models.
Contribution
It decouples the effects of modeling paradigm and architecture in language models, providing a fair comparison and insights into design trade-offs.
Findings
Decoder-only MDM achieves 25x speedup over encoder-only MDM.
Permutation averaging in AO-AR may be less informative than the language's natural order.
Decoder-only MDM attains comparable perplexity with temperature annealing.
Abstract
Large language models (LLMs) predominantly use autoregressive (AR) approaches, but masked diffusion models (MDMs) are emerging as viable alternatives. A key challenge in comparing AR and MDM paradigms is their typical architectural difference: AR models are often decoder-only, while MDMs have largely been encoder-only. This practice of changing both the modeling paradigm and architecture simultaneously makes direct comparisons unfair, as it's hard to distinguish whether observed differences stem from the paradigm itself or the architectural shift. This research evaluates MDMs within a decoder-only framework to: (1) equitably compare MDM (as Any-Order AR, or AO-AR) and standard AR paradigms. Our investigation suggests that the standard AO-AR objective, which averages over all token permutations, may benefit from refinement, as many permutations appear less informative compared to the…
Peer Reviews
Decision·Submitted to ICLR 2026
++ By keeping the backbone decoder‑only and varying only the order/formulation, the paper cleanly separates effects of AO‑AR/MDM vs. AR. Section 2.2 crisply contrasts training signal density, density‑estimation, and generation complexity (O(n) for decoder‑only with KV cache vs. ~O(T·n) for encoder‑only MDM), avoiding the usual apples‑to‑oranges comparisons. ++ The work unifies masked‑diffusion learning and any‑order AR by showing LMDM ≡ LAO‑AR, anchoring later design choices and analyses. This
-- Results are primarily at GPT‑2 small/medium scale; claims about competitiveness would be more convincing at ≥1B parameters and on stronger reasoning/benchmarks beyond perplexity. -- The ~25× speedup is promising but depends on decoder‑only specifics; head‑to‑head latency/throughput vs. tuned AR (Flash‑/paged‑KV, speculative decoding) and vs. well‑optimized encoder‑only MDMs (varying T) under identical hardware and sequence lengths would strengthen the claim. -- It’s unclear whether AO‑GPT a
1. Well-posed problem statement. The paper identifies a real confounder in current comparisons: AR↔MDM and decoder-only↔encoder-only are almost always changed together, so we don’t know which factor is responsible for the gap. Making MDM/AO run on a GPT-style decoder is a clean way to decouple these effects. This is genuinely useful for the community. 2. Concrete, nontrivial engineering recipe. The combination “any-order objective + per-layer target-position injection (adaLN) + very slow EMA + 1
1. “Fair comparison” is only partially fair. The central claim is “we decouple formulation and architecture,” but the decoder-only any-order model gets a custom training recipe (adaLN, EMA=0.9999, 10% L2R mixing) that is not re-applied and re-tuned for the encoder-only diffusion baselines it is compared against. Yet we know from MDLM and EDLM that diffusion LMs are very sensitive to the exact denoising/weighting schedule and to Rao-Blackwellization tricks. If you let the baselines also adopt “a
(1). This work clearly and rigorously shows that the MDM loss function is mathematically equivalent to the AO-AR loss. This is important for putting AR and MDM on a common theoretical footing, isolating the true source of differences to the token-order distribution rather than architecture, and enabling apples-to-apples empirical comparisons that inform practical choices like order mixing, annealing, and ensembling. Moreover, the equivalence between the efficient sampling algorithm and Eq. (8) a
(1). The major concern is the scalability of the findings elaborated in this work. This work only tests small model with 350M parameters. It is unclear whether the findings like the observed convergence behavior and order-mixing benefits hold true for larger models. Even though the manuscript acknowledges this, but it does not provide confirming evidence, which is indeed a non-negligible weakness. (2). This work only tests on perplexity, with limited coverage of downstream conditional tasks lik
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsStochastic processes and financial applications
MethodsDiffusion
