Layer-Parallel Training for Transformers
Shuai Jiang, Marc Salvad\'o-Benasco, Eric C. Cyr, Alena Kopani\v{c}\'akov\'a, Rolf Krause, and Jacob B. Schroder

TL;DR
This paper introduces a layer-parallel training method for transformers using a neural ODE formulation, enabling scalable parallel training for large models while maintaining accuracy.
Contribution
It develops a multilevel parallel-in-time algorithm for transformer training, addressing bias issues and demonstrating effective acceleration on large models.
Findings
Achieves parallel acceleration over layer dimension.
Maintains accuracy comparable to serial training.
Effective on models like BERT, GPT2, ViT.
Abstract
We present a new training methodology for transformers using a multilevel, layer-parallel approach. Through a neural ODE formulation of transformers, our application of a multilevel parallel-in-time algorithm for the forward and backpropagation phases of training achieves parallel acceleration over the layer dimension. This dramatically enhances parallel scalability as the network depth increases, which is particularly useful for increasingly large foundational models. However, achieving this introduces errors that cause systematic bias in the gradients, which in turn reduces convergence when closer to the minima. We develop an algorithm to detect this critical transition and either switch to serial training or systematically increase the accuracy of layer-parallel training. Results, including BERT, GPT2, ViT, and machine translation architectures, demonstrate parallel-acceleration as…
Peer Reviews
Decision·Submitted to ICLR 2026
- The paper addresses a core scalability bottleneck in transformer models, layerwise serialization, by proposing a layer-parallel strategy that is compatible with large-scale distributed training. - The methodology includes an insightful and actionable bias/gradient control mechanism, allowing adaptive switching between inexact (layer-parallel) and exact (serial) computation as training approaches convergence—a pragmatic solution to bias-induced optimization pitfalls. - Experiments are substan
- **Insufficient Baseline Coverage in Scaling Experiments**: In the scaling studies (Figures 6–8), while classic strong scaling is shown for different architectures, there is no direct comparison to pipeline or tensor parallelism baselines under identical hardware and problem settings. It is not clear how much of the observed speedup is unique to the proposed method, as opposed to achievable by state-of-practice alternatives. - **Ambiguity in Practical Integration**: The presentation lacks subst
The idea is interesting. Although the methods come from numerical ODEs and are not new, applying them to training transformers is conceptually interesting and orthogonal to other parallel strategies.
- Unfair comparison baseline: The serial baseline is single-GPU sequential training, while the proposed method uses multi-GPU resources. Instead, standard multi-GPU baselines like TP or PP should be compared. - Limited layer-parallel training horizon: Once switched to serial mode, the algorithm never re-enables MGRIT. Moreover, the horizon where MGRIT is enabled is very limited. In the GPT-2 pretraining experiment, MGRIT is disabled after 1,000 batches, corresponding to 0.25B tokens (batch size
- The paper is well structured and easy to follow. - The method is evaluated across multiple model architectures and hyperparameter settings. - Monitoring residuals effectively identifies the transition point between parallel and serial training.
- The main contribution of the paper is the application of an existing layer-parallel training method to encoder–decoder architectures and the introduction of a monitoring mechanism for training transition. However, applying the layer-parallel approach to encoder–decoder architectures does not appear to differ significantly from its application to encoder-only or decoder-only architectures, making the contribution rather limited. - In Figure 2, when training at level-1, Layers 1, 3, 5, and 7 are
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNeural Networks and Reservoir Computing · Model Reduction and Neural Networks · Neural Networks and Applications
