Learning High-Degree Parities: The Crucial Role of the Initialization
Emmanuel Abbe, Elisabetta Cornacchia, Jan H\k{a}z{\l}a, Donald, Kougang-Yombi

TL;DR
This paper investigates how the initial weight distribution affects the ability of gradient descent to learn high-degree parities, revealing that certain initializations enable learning of almost-full parities while others hinder it.
Contribution
It demonstrates that the learnability of high-degree parities depends critically on the initial weight distribution, with specific initializations enabling or preventing learning.
Findings
Discrete Rademacher initialization enables learning of almost-full parities.
Gaussian perturbation with large standard deviation prevents learning.
Learnability threshold depends on the standard deviation of initial weights.
Abstract
Parities have become a standard benchmark for evaluating learning algorithms. Recent works show that regular neural networks trained by gradient descent can efficiently learn degree parities on uniform inputs for constant , but fail to do so when and grow with (here is the ambient dimension). However, the case where (almost-full parities), including the degree parity (the full parity), has remained unsettled. This paper shows that for gradient descent on regular neural networks, learnability depends on the initial weight distribution. On one hand, the discrete Rademacher initialization enables efficient learning of almost-full parities, while on the other hand, its Gaussian perturbation with large enough constant standard deviation prevents it. The positive result for almost-full parities is shown to hold up to ,…
Peer Reviews
Decision·ICLR 2025 Poster
**Novelty** : While sparse parity problems ($k=O_d(1)$) are well studied in the context of representation learning with two-layer neural networks, I did not know that full parity can be learned with polynomial complexity. Thinking analogously to the sparse parity problems, we would need to train the first layer matrix but the gradient would vanish as the order of the parity gets larger. However, for the full parity, if the Rademacher initialization used, the gradient of the second layer is exac
**Motivation**: Although this paper trains the two-layer neural networks, the mechanism of learning is significantly different from sparse parity. In the sparse parity problems, people mainly discusses how the first layer weights align with the meaningful subspace. On the other hand, every direction is equivalent in this full parity setting, thus I am not sure whether this paper is motivated as a feature learning paper. Thus the paper should explain why solving full parity how neural network i
See summary. The paper has a nice negative result for correlation loss SGD trained network with poor alignment initialization.
My main criticism of this paper is two-fold. 1. The paper, throughout the beginning (including the abstract), gives the impression that they show the success of SGD for even almost full parity $k=d-O(1)$, but the positive result is only shown for the full parity case $k=d$. For example, the abstract lines (14-17). Later also in Lines (65-77). I found this very confusing when I got to the main formal results. (See also my Question 1) 2. To me the real contribution is the negative result. Particu
The manuscript is reasonably well-written. The primary contribution—demonstrating the gap between Rademacher and Gaussian initialization—is an interesting result, particularly in the context of how initialization impacts neural network learning.
* I would be careful when claiming negative results as they depend on additional noise injected to SGD, i.e., Z^t terms in Eq (3), which seems unnatural compared to other theoretical results related to ReLU networks. Is there a hope to extend to the negative results to SGD without additional noise injection? * The positive result in the paper notably involves training only the output layer, meaning there is no feature learning. However, it is unclear whether the authors consider training input
Code & Models
Videos
Taxonomy
TopicsSecond Language Learning and Teaching
