Planned Diffusion
Daniel Israel, Tian Jin, Ellie Cheng, Guy Van den Broeck, Aditya Grover, Suvinay Subramanian, Michael Carbin

TL;DR
Planned diffusion enables large language models to generate responses faster by learning to determine their own denoising order, combining autoregressive planning with parallel diffusion, thus improving speed with minimal quality loss.
Contribution
The paper introduces planned diffusion, a novel system where models learn to decide their own denoising order, combining autoregressive planning with parallel diffusion for efficient generation.
Findings
Achieves 1.27x to 1.81x speedup over autoregressive generation.
Maintains high instruction-following quality with minimal drop in win rate.
Provides tunable runtime controls for quality-latency trade-offs.
Abstract
Most large language models are autoregressive: they generate tokens one at a time. Discrete diffusion language models can generate multiple tokens in parallel, but sampling from them requires a denoising order: a strategy for deciding which tokens to decode at each step. Determining a good denoising order is difficult, and existing approaches use heuristics that create a steep trade-off between quality and latency. We propose planned diffusion, a system that trains the model to determine its own denoising order. Planned diffusion uses a single model that transitions between autoregressive and diffusion-based generation: first, the model autoregressively generates a plan that partitions the response into semantically independent chunks; second, the model denoises all chunks in parallel. The autoregressive plan enables the model to define the denoising order itself. On AlpacaEval, planned…
Peer Reviews
Decision·ICLR 2026 Poster
1. Well-motivated: very relevant problem and one that addresses a key weakness in diffusion language models 2. Novelty: combining an autoregressive planning stage with a diffusion-based parallel generation stage within a single unified model. 3. Implementation: Proposes reasonable set of methods that includes a new control tag language, model training methodology, and inference algorithm that enable planned diffusion and navigation of a Pareto frontier between speed and performance.
1. Evaluation Scope: Evaluation is only on AlpacaEval and lacks any other benchmarks, tasks, or domains. 2. Baselines: There is only one baseline that is not the vanilla baselines of autoregressive models and diffusion LLMs. 3. Complexity: Quite a lot of complexity without full ablation to justify each design choice 4. Trade-off: A performance loss of 6.8% is still pretty substantial and it is not clear how much speed-up one could get with say a smaller model or speculative decoding.
- Clear, appealing idea: formal two-stage factorization (planning then parallel diffusion), with an explicit algorithm and attention-masking design. - Well-specified control language (<topic>, <async>, <sync/>) that makes semantic parallelism concrete and implementable. - Empirical evidence of a new speed/quality trade-off vs. AR and diffusion baselines (latency–quality plots, critical-path analysis, scaling behavior). - Sensitivity analyses help demystify behavior: best performance when using
- Benchmark scope: Results focus on AlpacaEval with an LLM-as-judge (LCWR). This is a useful proxy but not a robust test of coherence/faithfulness across diverse tasks (e.g., reasoning, long-form, safety). Lack of human evals or broader benchmarks (e.g., MT-Bench, GSM-8K reasoning slices, instruction-following suites) weakens generality. - Baselines & fairness details: Diffusion is configured with steps equal to token count, and fast-dLLM with specific hyperparameters; however, broader ablations
1. First text-only model combining discrete diffusion with autoregression in a unified architecture, addressing the speed-quality tradeoff from a novel angle 2. Hybrid attention masking elegantly enables both causal and bidirectional attention; KV caching strategy is well-designed for this architecture 3. Establishes new Pareto frontier point; sensitivity analysis confirms model learns accurate length prediction without systematic bias 4. Method is orthogonal to other acceleration techniques and
1. Only AlpacaEval benchmark; no evaluation on diverse tasks (summarization, QA, code generation, creative writing). How does performance vary across task types? 2. No direct comparison to other semantic parallelism methods (e.g., Skeleton-of-Thought, APAR, ParaThinker) despite extensive related work discussion. This is critical for establishing true contribution. 3. Relies on Gemini for training data annotation. What is annotation quality? How many examples were rejected? Could this be learned
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Computational and Text Analysis Methods · Natural Language Processing Techniques
