Adaptive Width Neural Networks
Federico Errica, Henrik Christiansen, Viktor Zaverkin, Mathias Niepert, Francesco Alesiani

TL;DR
This paper presents a novel method for neural networks to learn their layer widths during training, enabling adaptive complexity and efficient resource management across diverse data types.
Contribution
It introduces a backpropagation-based technique for jointly learning layer widths and parameters, reducing reliance on manual tuning and hyperparameter search.
Findings
Width adapts to task difficulty across data domains
Enables easy truncation and compression of trained networks
Reduces hyperparameter tuning in large-scale models
Abstract
For almost 70 years, researchers have typically selected the width of neural networks' layers either manually or through automated hyperparameter tuning methods such as grid search and, more recently, neural architecture search. This paper challenges the status quo by introducing an easy-to-use technique to learn an unbounded width of a neural network's layer during training. The method jointly optimizes the width and the parameters of each layer via standard backpropagation. We apply the technique to a broad range of data domains such as tables, images, text, sequences, and graphs, showing how the width adapts to the task's difficulty. A by product of our width learning approach is the easy truncation of the trained network at virtually zero cost, achieving a smooth trade-off between performance and compute resources. Alternatively, one can dynamically compress the network until…
Peer Reviews
Decision·ICLR 2026 Poster
# Clarity The paper is well-written. # Significance This paper proposes a Bayesian formulation of the problem of width selection when training a neural network, which is a well-grounded way of performing pruning and adding neurons (layer-wise). Additionally, the fact that, in theory, infinite-width neural networks are trainable (up to a decreasing scaling factor $f_l(i)$), makes this method fit several well-known theoretical frameworks (e.g., Neural Tangent Kernels). # Novelty There are severa
I do not see major weaknesses in this paper. But, on the technical side, several aspects deserve a discussion : 1. why choosing a discretized exponential distribution for $f_l$? Is there a theoretical/heuristic argument? One could choose, as in [2], slowly decreasing scalings, such as $x \mapsto 1/x$ or $x \mapsto 1/(\sqrt{x} \ln(x))$... 2. according to Eqn. (7), $\lambda_l$ could be either positive or negative, since it is a Gaussian random variable. This is not acceptable, provided Eqn. (6), w
1. The main advantage of the paper is that the latent variable which controls the width is learnt, so the addition and deletion of neurons is fast and requires significantly less compute than existing hessian based methods. 2. The training is stable at appropriate batch sizes, since the importance is low for large widths, and usually we start with a considerably large width for large models. 3. The strict ordering based on index ensures that old neurons are preserved, which again improves the s
1. Its harder for the neural network to change trajectory, since the importance is provided based on the index, if an old neuron has to be removed all newer neurons must be pruned as well, this limits the neural network from moving away from features it learnt. 2. More redundant copies, the importance acts as a weight to the activation, therefore when the strength of an important neuron might not be enough, the neural network may choose to learn redundant copies to strengthen the importance of f
1. Strong Motivation and Practical Impact: The authors address the challenge of hyperparameter tuning, a particularly arduous and resource-intensive task for large-scale networks. Automating the search for the width hyperparameter is therefore a highly practical contribution, especially in modern architectures with billions of parameters. 2. Efficient Post-Training Trade-Off: The proposed method offers the compelling ability to control model complexity after training with zero additional cost
1. Limited Demonstration of Practicality and Scalability: Although the paper emphasizes practical implications, the experimental results do not fully demonstrate the method’s practical utility or scalability. Specifically, the approach is only applied to multi-layer perceptron (MLP) layers across various models; for convolutional neural networks (CNNs), it is restricted to the final MLP classifier layer. Given the paper’s claim of general applicability, it is unclear why the adaptive width mech
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNeural Networks and Applications · Advanced Data Compression Techniques · Image Retrieval and Classification Techniques
