Sparse Training of Neural Networks based on Multilevel Mirror Descent

Yannick Lunk; Sebastian J. Scott; Leon Bungert

arXiv:2602.03535·cs.LG·May 19, 2026

Sparse Training of Neural Networks based on Multilevel Mirror Descent

Yannick Lunk, Sebastian J. Scott, Leon Bungert

PDF

3 Reviews

TL;DR

This paper presents a novel dynamic sparse training algorithm for neural networks that combines Bregman iterations with multilevel optimization, achieving high sparsity, reduced computational cost, and maintained accuracy.

Contribution

It introduces a new sparse training method based on mirror descent and multilevel optimization, with proven convergence and significant efficiency improvements.

Findings

01

Achieves high sparsity with maintained accuracy on benchmarks.

02

Reduces FLOPs from 38% to 6% compared to SGD.

03

Cuts training time by about 50% with a sparsity-aware CPU implementation.

Abstract

We introduce a dynamic sparse training algorithm based on linearized Bregman iterations / mirror descent that exploits the naturally incurred sparsity by alternating between periods of static and dynamic sparsity pattern updates. The key idea is to combine sparsity-inducing Bregman iterations with adaptive freezing of the network structure to enable efficient exploration of the sparse parameter space while maintaining sparsity. We provide convergence guaranties by embedding our method in a multilevel optimization framework. Furthermore, we empirically show that our algorithm can produce highly sparse and accurate models on standard benchmarks. We also show that the theoretical number of FLOPs compared to SGD training can be reduced from 38% for standard Bregman iterations to 6% for our method while maintaining test accuracy.We additionally show a training time reduction by about 50%,…

Peer Reviews

Decision·Submitted to ICLR 2026

Reviewer 01Rating 4Confidence 3

Strengths

1. Offers a convergence proof adapting results from ML-BPGD and relative smoothness theory 2. Empirical results on CIFAR-10 and TinyImageNet (ResNet-18, VGG-16, WideResNet-28-10) consistently support claims of sparsity and efficiency.

Weaknesses

1. Limited to small-scale image datasets, lacks evaluation on large-scale or transformer architectures to demonstrate generality. 2. FLOP analysis is theoretical. It would be good to have experimental results. 3. Missing recent dynamic sparse training baselines and structured-pruning comparisons.

Reviewer 02Rating 6Confidence 3

Strengths

1) The core idea of alternating between a "fix" sparse-training phase (the m coarse steps) and a "explore" phase (the 1 fine step) is simple, elegant, and well-motivated. 2) The paper's primary selling point is the massive theoretical reduction in FLOPs as shown in Appendix B. The authors demonstrate the source of the savings: standard LinBreg must compute a dense gradient at every step, while ML LinBreg only does so 1 out of every m+1 steps. 3) The proposed ML LinBreg consistently outperforms

Weaknesses

1) I think the paper can be made better by explaining the mismatch between the theoretical analysis and the practical implementation. The convergence guarantee (Theorem 1) is provided for the deterministic (exact gradient) setting. However, the algorithm is implemented using mini-batch "unbiased estimators" (Algorithm 1), which is a stochastic setting. 2) The headline-grabbing FLOPs reduction (to 6%) is entirely theoretical. This calculation assumes that unstructured sparsity (e.g., from $l_1$

Reviewer 03Rating 6Confidence 4

Strengths

+Clear optimizer view that unifies sparse training with mirror descent and Bregman iterations, including explicit proximal updates and a principled route to sparsity. +Multilevel design that freezes structure and updates only active parameters, with formal handling of restriction and prolongation and a stated convergence result. +Theoretical FLOP reductions relative to standard Bregman iterations are articulated and support the motivation for the freezing schedule.

Weaknesses

-Convergence relies on relative smoothness and a PL-type inequality and is stated for exact gradients, so the guarantees do not directly cover the stochastic setting used in practice. -The paper discusses potential computational savings but does not provide wall clock training speedups, energy, or memory profiles on hardware. -The current evaluation focuses only on small-image vision classification tasks, which limits the generality of the findings. Such experiments may not capture the behavior

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsStochastic Gradient Optimization Techniques · Sparse and Compressive Sensing Techniques · Domain Adaptation and Few-Shot Learning