Layer-Parallel Training for Transformers

Shuai Jiang; Marc Salvad\'o-Benasco; Eric C. Cyr; Alena Kopani\v{c}\'akov\'a; Rolf Krause; and Jacob B. Schroder

arXiv:2601.09026·cs.LG·January 27, 2026

Layer-Parallel Training for Transformers

Shuai Jiang, Marc Salvad\'o-Benasco, Eric C. Cyr, Alena Kopani\v{c}\'akov\'a, Rolf Krause, and Jacob B. Schroder

PDF

Open Access 3 Reviews

TL;DR

This paper introduces a layer-parallel training method for transformers using a neural ODE formulation, enabling scalable parallel training for large models while maintaining accuracy.

Contribution

It develops a multilevel parallel-in-time algorithm for transformer training, addressing bias issues and demonstrating effective acceleration on large models.

Findings

01

Achieves parallel acceleration over layer dimension.

02

Maintains accuracy comparable to serial training.

03

Effective on models like BERT, GPT2, ViT.

Abstract

We present a new training methodology for transformers using a multilevel, layer-parallel approach. Through a neural ODE formulation of transformers, our application of a multilevel parallel-in-time algorithm for the forward and backpropagation phases of training achieves parallel acceleration over the layer dimension. This dramatically enhances parallel scalability as the network depth increases, which is particularly useful for increasingly large foundational models. However, achieving this introduces errors that cause systematic bias in the gradients, which in turn reduces convergence when closer to the minima. We develop an algorithm to detect this critical transition and either switch to serial training or systematically increase the accuracy of layer-parallel training. Results, including BERT, GPT2, ViT, and machine translation architectures, demonstrate parallel-acceleration as…

Peer Reviews

Decision·Submitted to ICLR 2026

Reviewer 01Rating 6Confidence 3

Strengths

- The paper addresses a core scalability bottleneck in transformer models, layerwise serialization, by proposing a layer-parallel strategy that is compatible with large-scale distributed training. - The methodology includes an insightful and actionable bias/gradient control mechanism, allowing adaptive switching between inexact (layer-parallel) and exact (serial) computation as training approaches convergence—a pragmatic solution to bias-induced optimization pitfalls. - Experiments are substan

Weaknesses

- **Insufficient Baseline Coverage in Scaling Experiments**: In the scaling studies (Figures 6–8), while classic strong scaling is shown for different architectures, there is no direct comparison to pipeline or tensor parallelism baselines under identical hardware and problem settings. It is not clear how much of the observed speedup is unique to the proposed method, as opposed to achievable by state-of-practice alternatives. - **Ambiguity in Practical Integration**: The presentation lacks subst

Reviewer 02Rating 2Confidence 3

Strengths

The idea is interesting. Although the methods come from numerical ODEs and are not new, applying them to training transformers is conceptually interesting and orthogonal to other parallel strategies.

Weaknesses

- Unfair comparison baseline: The serial baseline is single-GPU sequential training, while the proposed method uses multi-GPU resources. Instead, standard multi-GPU baselines like TP or PP should be compared. - Limited layer-parallel training horizon: Once switched to serial mode, the algorithm never re-enables MGRIT. Moreover, the horizon where MGRIT is enabled is very limited. In the GPT-2 pretraining experiment, MGRIT is disabled after 1,000 batches, corresponding to 0.25B tokens (batch size

Reviewer 03Rating 4Confidence 4

Strengths

- The paper is well structured and easy to follow. - The method is evaluated across multiple model architectures and hyperparameter settings. - Monitoring residuals effectively identifies the transition point between parallel and serial training.

Weaknesses

- The main contribution of the paper is the application of an existing layer-parallel training method to encoder–decoder architectures and the introduction of a monitoring mechanism for training transition. However, applying the layer-parallel approach to encoder–decoder architectures does not appear to differ significantly from its application to encoder-only or decoder-only architectures, making the contribution rather limited. - In Figure 2, when training at level-1, Layers 1, 3, 5, and 7 are

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNeural Networks and Reservoir Computing · Model Reduction and Neural Networks · Neural Networks and Applications