Stochastic Layer-wise Learning: Scalable and Efficient Alternative to Backpropagation

Bojian Yin; Federico Corradi

arXiv:2505.05181·cs.LG·October 1, 2025

Stochastic Layer-wise Learning: Scalable and Efficient Alternative to Backpropagation

Bojian Yin, Federico Corradi

PDF

3 Reviews

TL;DR

This paper introduces Stochastic Layer-wise Learning (SLL), a scalable local training method that replaces backpropagation, maintains global coherence, and performs well on various neural network architectures and datasets.

Contribution

SLL is a novel layer-wise training algorithm that decomposes the global objective into local layer objectives using ELBO-inspired principles and stochastic regularization.

Findings

01

SLL outperforms recent local methods in experiments.

02

SLL matches backpropagation performance on multiple architectures.

03

Memory usage of SLL remains invariant with network depth.

Abstract

Backpropagation underpins modern deep learning, yet its reliance on global gradient synchronization limits scalability and incurs high memory costs. In contrast, fully local learning rules are more efficient but often struggle to maintain the cross-layer coordination needed for coherent global learning. Building on this tension, we introduce Stochastic Layer-wise Learning (SLL), a layer-wise training algorithm that decomposes the global objective into coordinated layer-local updates while preserving global representational coherence. The method is ELBO-inspired under a Markov assumption on the network, where the network-level objective decomposes into layer-wise terms and each layer optimizes a local objective via a deterministic encoder. The intractable KL in ELBO is replaced by a Bhattacharyya surrogate computed on auxiliary categorical posteriors obtained via fixed…

Peer Reviews

Decision·Submitted to ICLR 2026

Reviewer 01Rating 4Confidence 2

Strengths

1. The paper presents a conceptually interesting and theoretically motivated attempt to unify local learning and probabilistic inference. 2. The algorithmic formulation is elegant and modular, avoiding explicit backpropagation while maintaining representational coherence through stochastic projections. 3. Experiments demonstrate that SLL can achieve performance close to backpropagation across several architectures, confirming the feasibility of local probabilistic learning.

Weaknesses

1. The Markov factorization across layers and the replacement of the KL term with a Bhattacharyya surrogate are heuristic. The paper does not prove that optimizing these surrogates reliably improves the global ELBO or overall convergence. 2. The experiments focus on standard vision benchmarks such as MNIST, CIFAR, and ImageNette, which are relatively small and may not sufficiently test scalability or robustness. Larger-scale or non-vision domains would strengthen the claims. 3. The method’s op

Reviewer 02Rating 2Confidence 4

Strengths

The main strengths of this paper are: - $S_1$: This paper addresses one of the important practical bottlenecks in backpropagation (BP): the memory cost, as BP needs to store activation memory and computational graph which are especially heavy for ViTs and long sequences. - $S_2$: This paper introduces an alternative to BP that is conceptually simple. The overall training technique decouples units and supervise with a low-dimensional summary via a simple local divergence. - $S_3$: The presented

Weaknesses

While the empirical results regarding memory seem promising, major weaknesses prevent me from recommending anything but reject for now. Some of those may be easily corrected by modifying the paper ($W_2$ for example). - $W_1$: After reading the paper in detail (and the appendix), I am questioning its theoretical correctness: > The “average layerwise ELBO $\leq$ network ELBO” inequality relies on two strong, unstated assumptions in the main text (monotone predictive gain across depth under the

Reviewer 03Rating 4Confidence 4

Strengths

1. The paper explores local learning from a probabilistic variational perspective, deriving a layer-wise learning objective based on the ELBO formulation. This theoretical contribution provides a fresh perspective on local learning. 2. The paper is well-organized with clear exposition and rigorous logical flow, making the technical content accessible to readers. 3. Beyond quantitative results, the paper provides additional visualizations including weight distributions and t-SNE-based representat

Weaknesses

1. Assumption 2 restricts the conditional dependence to only adjacent layer representations, leading to KL divergence-based local supervision that follows a first-order Markov assumption. This posterior estimation may neglect important long-range cross-layer information exchange, potentially limiting the model's expressive power. 2. The use of random projection for dimension reduction may result in constrained representation learning and raises concerns about scalability to large-scale trainin

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.