Expressivity of Neural Networks with Random Weights and Learned Biases

Ezekiel Williams; Alexandre Payeur; Avery Hee-Woon Ryoo; Thomas; Jiralerspong; Matthew G. Perich; Luca Mazzucato; Guillaume Lajoie

arXiv:2407.00957·cs.NE·March 25, 2025

Expressivity of Neural Networks with Random Weights and Learned Biases

Ezekiel Williams, Alexandre Payeur, Avery Hee-Woon Ryoo, Thomas, Jiralerspong, Matthew G. Perich, Luca Mazzucato, Guillaume Lajoie

PDF

3 Reviews

TL;DR

This paper proves that neural networks with fixed random weights can still approximate any continuous function or dynamical system when only biases are learned, highlighting their expressive power.

Contribution

It provides the first theoretical and numerical evidence that biases alone can enable universal approximation in neural networks with fixed random weights.

Findings

01

Fixed random weights do not hinder universal approximation.

02

Bias-only training can approximate continuous functions.

03

Recurrent networks with fixed weights can model dynamical systems.

Abstract

Landmark universal function approximation results for neural networks with trained weights and biases provided the impetus for the ubiquitous use of neural networks as learning models in neuroscience and Artificial Intelligence (AI). Recent work has extended these results to networks in which a smaller subset of weights (e.g., output weights) are tuned, leaving other parameters random. However, it remains an open question whether universal approximation holds when only biases are learned, despite evidence from neuroscience and AI that biases significantly shape neural responses. The current paper answers this question. We provide theoretical and numerical evidence demonstrating that feedforward neural networks with fixed random weights can approximate any continuous function on compact sets. We further show an analogous result for the approximation of dynamical systems with recurrent…

Peer Reviews

Decision·ICLR 2025 Poster

Reviewer 01Rating 6Confidence 5

Strengths

- Overall, there is strength in its novelty of proving that bias learning in neural networks can have high expressivity that performs almost as well as a fully-trained network. This is significant because bias learning trains fewer parameters than a full network. - Nature of bias learning is more behaviorally relevant in the context of tonic inputs, intrinsic cell parameters, threshold adaptation, and intrinsic excitability - The theoretical proofs are very thorough, and backed up by numerical p

Weaknesses

- In response to bias learning having fewer parameters to learn, no data was shown on training time - Little background was given on mask learning (the mask learning section was also super short - felt less developed relative to other parts of the paper). This is important because two of their highlights in the results relate to mask learning. - Makes claims (i.e. lines 253 - 259, lines 418-420) that could have been easily backed up by data, but were not. - Figure 1 color scheme is weird - Pract

Reviewer 02Rating 8Confidence 3

Strengths

The authors connect well to the neuroscience and AI/ML literature and explain the proofs in an intuitive manner. The extension to RNNs and dynamical systems is also commendable as these often receive reduced attention in the ML community. The issue with the "gain" g in the weight distribution is well brought out.

Weaknesses

The section 3.3.2 on the Lorenz system is not clearly written and the architecture and external input to the network are not clear. At first glance, the result seems to be a simple extension of the masking theorem of Malach et al 2020. The difference with that proof should be made clear.

Reviewer 03Rating 6Confidence 4

Strengths

The main expressivity results shown are well-explained and seem mathematically tight. Considering that these results made use of a reduction to mask learning problems, the authors also do a good job discussing the relationship between their findings and those of the mask learning literature.

Weaknesses

A crucial aspect of this work with regards to its practical relevance is how large a bias-trained network needs to be to achieve similar performance to a fully trained network. Surely the scaling is better than the extreme network expansions constructed for the existence proofs, but how much better? The authors allude to performance as a function of trainable parameter count scaling similarly to fully trained networks, and thus only needing quadratic scaling in layer width, but they only evidenc

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.