DiffusionBlocks: Block-wise Neural Network Training via Diffusion Interpretation

Makoto Shing; Masanori Koyama; Takuya Akiba

arXiv:2506.14202·cs.LG·February 19, 2026

DiffusionBlocks: Block-wise Neural Network Training via Diffusion Interpretation

Makoto Shing, Masanori Koyama, Takuya Akiba

PDF

Open Access 3 Reviews

TL;DR

DiffusionBlocks introduces a theoretically grounded block-wise training framework for transformer networks, reducing memory bottlenecks while maintaining performance across diverse architectures and tasks.

Contribution

It transforms transformer networks into independent trainable blocks using a diffusion interpretation, enabling scalable training with minimal modifications.

Findings

01

Matches end-to-end training performance

02

Reduces memory requirements proportionally to number of blocks

03

Effective across various transformer architectures

Abstract

End-to-end backpropagation requires storing activations throughout all layers, creating memory bottlenecks that limit model scalability. Existing block-wise training methods offer means to alleviate this problem, but they rely on ad-hoc local objectives and remain largely unexplored beyond classification tasks. We propose $DiffusionBlocks$ , a principled framework for transforming transformer-based networks into genuinely independent trainable blocks that maintain competitive performance with end-to-end training. Our key insight leverages the fact that residual connections naturally correspond to updates in a dynamical system. With minimal modifications to this system, we can convert the updates to those of a denoising process, where each block can be learned independently by leveraging the score matching objective. This independence enables training with gradients for only one…

Peer Reviews

Decision·ICLR 2026 Poster

Reviewer 01Rating 4Confidence 4

Strengths

1. The method looks simple and easy to implement. 2. The method targets an important problem: scalability and training/inference speed of large models. 3. For the most part, the paper is well written and easy to read.

Weaknesses

1. However the notation sometimes lacks clarity, for example it's unclear what are $(x,y)$. From section 2.1, it appears that $y$ is data (e.g. an image for example). So that would make $x$ a label supposedly, however in Fig.3 $z_0\leftarrow x$ which suggest data rather than label. In Figure 2, in the case of the classifier, it would like $y$ is actually a label. A simple fix would be to clearly specify what $x$ and $y$ are and keep the notation consistent thorough the entire paper. 2. The resul

Reviewer 02Rating 8Confidence 3

Strengths

S1. The method seems very novel. Converting these disparate tasks (classification, image generation, text generation) into denoising tasks and separating the forward/backprop for the network into different groups of layers (blocks) seems to solve a problem that prior block-wise training methods could not. S2. Under the specified setups, the block-wise training matches or beats the baseline.

Weaknesses

W1. On practical terms, this method would seem to have no advantage over recompute/checkpointing. In fact, the memory savings from recompute are substantially higher. W2. It is not clear how this affects training efficiency (wall time). W3. Baselines are odd. CIFAR-100 is a little toy-ish, and the default ViT performs quite poorly on it. The DiT is also under-optimized, and I assume this applies to other things as well. So it's not actually clear that this method can match any end-to-end trai

Reviewer 03Rating 6Confidence 4

Strengths

- The main idea of this paper is novel (concurrent to NoProp). It tries to convert the foward propagation proccess of residual-based architectures to a diffusion denoising process. - The method is evaluated on numerous tasks (image + text classifcation and generation) - There are some theory analysis of the partitioning approach

Weaknesses

- In Table 2, the number of training epoch/iterations are not reported. For DiT-L/2 on ImageNet256x256, what are the FID for with and without classifier free guidance (CFG). The current result does not state whether CFG is used. - To my understanding Equation 4 should only denote one residual connection. In ViT, one residual connection is for the self-attention operation and one residual connection is for the MLP. To be rigourous with the thoery, each denoising block should denote a single resid

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGenerative Adversarial Networks and Image Synthesis · Domain Adaptation and Few-Shot Learning · Machine Learning in Healthcare

MethodsDiffusion