When Are Bias-Free ReLU Networks Effectively Linear Networks?
Yedi Zhang, Andrew Saxe, Peter E. Latham

TL;DR
This paper explores how removing biases from ReLU networks limits their expressivity and makes their learning dynamics similar to linear networks, especially under symmetry conditions, impacting their ability to model complex functions.
Contribution
It demonstrates the limited expressivity of bias-free ReLU networks and establishes their equivalence to linear networks under certain conditions, providing analytical insights into their learning behavior.
Findings
Two-layer bias-free ReLU networks can only express linear odd functions.
Under symmetry conditions, these networks behave like linear networks during learning.
Deep bias-free ReLU networks share properties with deep linear networks.
Abstract
We investigate the implications of removing bias in ReLU networks regarding their expressivity and learning dynamics. We first show that two-layer bias-free ReLU networks have limited expressivity: the only odd function two-layer bias-free ReLU networks can express is a linear one. We then show that, under symmetry conditions on the data, these networks have the same learning dynamics as linear networks. This enables us to give analytical time-course solutions to certain two-layer bias-free (leaky) ReLU networks outside the lazy learning regime. While deep bias-free ReLU networks are more expressive than their two-layer counterparts, they still share a number of similarities with deep linear networks. These similarities enable us to leverage insights from linear networks to understand certain ReLU networks. Overall, our results show that some properties previously established for…
Peer Reviews
Decision·Submitted to ICLR 2025
The paper is predominantly theoretical and full proofs are provided. The experimental parts complement the paper well and illustrate the theoretical findings. The presentation is very good and polished and the main claims in the paper are clearly presented. Understanding the expressiveness and training dynamics of different network architectures is important. The impact of removing bias from the architecture is also an interesting topic to study, in particular, as the authors point out, becaus
The main limitation of the results on training dynamics is that the results are limited to the case that the target model is odd (in addition to some more mild assumptions). Specifically, it is shown that in this case, two-layer bias-free (leaky) ReLU networks essentially behave like a linear network. There is evidence that even slight violations of this property of the target model, make the network behave in a non-linear way in later phases of training. I appreciate that it is challenging, but
The paper gives novel insights on both expressivity and optimization of bias-free networks. As these networks have been studied often in prior theoretical works, discussing their limitations sheds new light on the conclusions from past theoretical works. Additionally, while ReLU networks often are trained with bias in practice, in certain situations bias-free networks have been used in practice, and thus understanding their behavior and limitation is important. The insights on the dynamics of th
- Previous work by Basri et al. (cited by the authors) shows that two-layer bias free networks cannot express non-linear odd functions when the inputs are uniformly distributed. The authors claim that the result in the paper is stronger, but it's not clear to me that this is the case. My understanding is that the authors show: for any non-linear odd function $f$, for any bias-free (leaky) ReLU network $h$, there exists some input $x$ such that $f(x) \neq h(x)$. My understanding is that Basri et
- Prior theoretical work considers networks with no bias, and thus it is an interesting question to understand the expressivity and learning dynamics of such networks. - The proofs appear to the best of my knowledge to be sound, and the paper is well-written and easy to follow.
- My main concern with the paper is that I find the contribution to be rather incremental, which limits the significance/impact of the work. For example, Theorem 7 requires both symmetry on the data and for the first layer to be initialized as rank 1 ($W_1 = W_2^Tr^T$). While the latter assumption is justified as a consequence of training from infinitesimal initialization, I still find these to be rather strong assumptions, and I do not think such equivalence between linear networks and ReLU net
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Memory and Neural Computing · Neural Networks and Applications · Network Security and Intrusion Detection
