TL;DR
This paper introduces Stochastic Layer-wise Learning (SLL), a scalable local training method that replaces backpropagation, maintains global coherence, and performs well on various neural network architectures and datasets.
Contribution
SLL is a novel layer-wise training algorithm that decomposes the global objective into local layer objectives using ELBO-inspired principles and stochastic regularization.
Findings
SLL outperforms recent local methods in experiments.
SLL matches backpropagation performance on multiple architectures.
Memory usage of SLL remains invariant with network depth.
Abstract
Backpropagation underpins modern deep learning, yet its reliance on global gradient synchronization limits scalability and incurs high memory costs. In contrast, fully local learning rules are more efficient but often struggle to maintain the cross-layer coordination needed for coherent global learning. Building on this tension, we introduce Stochastic Layer-wise Learning (SLL), a layer-wise training algorithm that decomposes the global objective into coordinated layer-local updates while preserving global representational coherence. The method is ELBO-inspired under a Markov assumption on the network, where the network-level objective decomposes into layer-wise terms and each layer optimizes a local objective via a deterministic encoder. The intractable KL in ELBO is replaced by a Bhattacharyya surrogate computed on auxiliary categorical posteriors obtained via fixed…
Peer Reviews
Decision·Submitted to ICLR 2026
1. The paper presents a conceptually interesting and theoretically motivated attempt to unify local learning and probabilistic inference. 2. The algorithmic formulation is elegant and modular, avoiding explicit backpropagation while maintaining representational coherence through stochastic projections. 3. Experiments demonstrate that SLL can achieve performance close to backpropagation across several architectures, confirming the feasibility of local probabilistic learning.
1. The Markov factorization across layers and the replacement of the KL term with a Bhattacharyya surrogate are heuristic. The paper does not prove that optimizing these surrogates reliably improves the global ELBO or overall convergence. 2. The experiments focus on standard vision benchmarks such as MNIST, CIFAR, and ImageNette, which are relatively small and may not sufficiently test scalability or robustness. Larger-scale or non-vision domains would strengthen the claims. 3. The method’s op
The main strengths of this paper are: - $S_1$: This paper addresses one of the important practical bottlenecks in backpropagation (BP): the memory cost, as BP needs to store activation memory and computational graph which are especially heavy for ViTs and long sequences. - $S_2$: This paper introduces an alternative to BP that is conceptually simple. The overall training technique decouples units and supervise with a low-dimensional summary via a simple local divergence. - $S_3$: The presented
While the empirical results regarding memory seem promising, major weaknesses prevent me from recommending anything but reject for now. Some of those may be easily corrected by modifying the paper ($W_2$ for example). - $W_1$: After reading the paper in detail (and the appendix), I am questioning its theoretical correctness: > The “average layerwise ELBO $\leq$ network ELBO” inequality relies on two strong, unstated assumptions in the main text (monotone predictive gain across depth under the
1. The paper explores local learning from a probabilistic variational perspective, deriving a layer-wise learning objective based on the ELBO formulation. This theoretical contribution provides a fresh perspective on local learning. 2. The paper is well-organized with clear exposition and rigorous logical flow, making the technical content accessible to readers. 3. Beyond quantitative results, the paper provides additional visualizations including weight distributions and t-SNE-based representat
1. Assumption 2 restricts the conditional dependence to only adjacent layer representations, leading to KL divergence-based local supervision that follows a first-order Markov assumption. This posterior estimation may neglect important long-range cross-layer information exchange, potentially limiting the model's expressive power. 2. The use of random projection for dimension reduction may result in constrained representation learning and raises concerns about scalability to large-scale trainin
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
