TL;DR
This paper introduces Blockwise SFT, a training method that aligns supervised fine-tuning with the blockwise decoding process of diffusion language models, leading to improved performance on mathematical reasoning tasks.
Contribution
It proposes a novel Blockwise SFT approach that partitions responses into blocks, aligning training with inference, and demonstrates its effectiveness on several benchmarks.
Findings
Consistent performance gains over classical SFT on GSM8K, MATH, and MetaMathQA.
Improvements are due to better training-inference alignment, not incidental masking effects.
Study confirms the importance of matching supervision granularity to decoding procedures.
Abstract
Discrete diffusion language models have shown strong potential for text generation, yet standard supervised fine-tuning (SFT) misaligns with their semi-autoregressive inference: training randomly masks tokens across the entire response, while inference generates fixed-size blocks sequentially. This mismatch introduces noisy prefixes and leaky suffixes, biasing gradients away from the desired blockwise likelihood. We propose Blockwise SFT, which partitions responses into fixed-size blocks, selects one active block per step for stochastic masking, freezes all preceding tokens, and fully hides future ones. Loss is computed only over the active block, directly mirroring the blockwise decoding process. Experiments on GSM8K, MATH, and MetaMathQA show consistent gains over classical SFT under equal compute or token budgets. Block size consistency studies and ablations confirm that improvements…
Peer Reviews
Decision·Submitted to ICLR 2026
- The paper nails a real but often-overlooked problem: standard SFT trains on full sequences with bidirectional context, but blockwise inference sees only a clean prefix and hides the future. The proposed Blockwise SFT fixes this by supervising only the active block with a clean prefix and hidden suffix. The theoretical contributions are solid—gradient-bias analysis (Theorem 3.1), a variational upper bound matching the inference factorization (Theorem 3.2), and an unbiased single-block estimator
1. Loss inside the active block (Eq. (5)) The paper defines $$ \tilde{\mathcal{L}}_t(\theta; \mathbf{x}, a) = -\sum_{i\in\mathcal{I}_a} \log p_\theta(x_i \mid \mathbf{z}_t, t), $$ yet Algorithm 1 samples an intra-block mask ($m_i\sim\mathrm{Bernoulli}(\pi)$) and states that loss is "only on the active block." In discrete diffusion / masked-LM style training, loss is normally computed only on masked positions; otherwise, unmasked tokens create an identity path and can yield dege
- **Clear motivation & strong alignment insight:** The paper precisely diagnoses the training–inference mismatch in diffusion LMs and proposes a clean fix grounded in the decoding procedure. - **Theory–practice coherence:** Provides formal bias analysis, variational bound, and unbiased gradient theorem; yet remains simple to implement (mask-only change). - **Empirical rigor:** Demonstrates consistent improvements across reasoning datasets, with ablations (block size, prefix noise, suffix lea
## Major - **Limited architectural coverage.** All experiments use only **LLaDA-8B-Instruct** with LoRA fine-tuning. The claim of being “architecture-agnostic” is unsupported without testing on other diffusion backbones such as **BlockDiffusion**, **APD**, or **RDM**. - **Scope restricted to reasoning tasks.** Evaluation is limited to GSM8K, MATH, and MetaMathQA, leaving uncertainty about performance on open-ended or dialogue data. - **Lack of scaling analysis.** The paper does not study beh
The paper is clearly written and well-organized. The methodology and theoretical sections are presented with both formal rigor and intuitive explanations. The algorithm and the diagrams make the proposed training recipe easy to follow, which enhances readability.
- Limited novelty: The proposed Blockwise SFT can be viewed as a straightforward adaptation of existing ideas for aligning training objectives with blockwise decoding. While the paper frames the problem clearly, the methodological innovation appears incremental. - Training inefficiency: Since each training step only supervises one active block, the number of effective supervised tokens per update is significantly lower than in standard SFT under the same FLOP budget. - Missing baselines: The e
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
