Learning High-Degree Parities: The Crucial Role of the Initialization

Emmanuel Abbe; Elisabetta Cornacchia; Jan H\k{a}z{\l}a; Donald; Kougang-Yombi

arXiv:2412.04910·cs.LG·March 6, 2025

Learning High-Degree Parities: The Crucial Role of the Initialization

Emmanuel Abbe, Elisabetta Cornacchia, Jan H\k{a}z{\l}a, Donald, Kougang-Yombi

PDF

Open Access 1 Repo 1 Video 3 Reviews

TL;DR

This paper investigates how the initial weight distribution affects the ability of gradient descent to learn high-degree parities, revealing that certain initializations enable learning of almost-full parities while others hinder it.

Contribution

It demonstrates that the learnability of high-degree parities depends critically on the initial weight distribution, with specific initializations enabling or preventing learning.

Findings

01

Discrete Rademacher initialization enables learning of almost-full parities.

02

Gaussian perturbation with large standard deviation prevents learning.

03

Learnability threshold depends on the standard deviation of initial weights.

Abstract

Parities have become a standard benchmark for evaluating learning algorithms. Recent works show that regular neural networks trained by gradient descent can efficiently learn degree $k$ parities on uniform inputs for constant $k$ , but fail to do so when $k$ and $d - k$ grow with $d$ (here $d$ is the ambient dimension). However, the case where $k = d - O_{d} (1)$ (almost-full parities), including the degree $d$ parity (the full parity), has remained unsettled. This paper shows that for gradient descent on regular neural networks, learnability depends on the initial weight distribution. On one hand, the discrete Rademacher initialization enables efficient learning of almost-full parities, while on the other hand, its Gaussian perturbation with large enough constant standard deviation $σ$ prevents it. The positive result for almost-full parities is shown to hold up to $σ = O (d^{- 1})$ ,…

Peer Reviews

Decision·ICLR 2025 Poster

Reviewer 01Rating 5Confidence 3

Strengths

**Novelty** : While sparse parity problems ($k=O_d(1)$) are well studied in the context of representation learning with two-layer neural networks, I did not know that full parity can be learned with polynomial complexity. Thinking analogously to the sparse parity problems, we would need to train the first layer matrix but the gradient would vanish as the order of the parity gets larger. However, for the full parity, if the Rademacher initialization used, the gradient of the second layer is exac

Weaknesses

**Motivation**: Although this paper trains the two-layer neural networks, the mechanism of learning is significantly different from sparse parity. In the sparse parity problems, people mainly discusses how the first layer weights align with the meaningful subspace. On the other hand, every direction is equivalent in this full parity setting, thus I am not sure whether this paper is motivated as a feature learning paper. Thus the paper should explain why solving full parity how neural network i

Reviewer 02Rating 6Confidence 3

Strengths

See summary. The paper has a nice negative result for correlation loss SGD trained network with poor alignment initialization.

Weaknesses

My main criticism of this paper is two-fold. 1. The paper, throughout the beginning (including the abstract), gives the impression that they show the success of SGD for even almost full parity $k=d-O(1)$, but the positive result is only shown for the full parity case $k=d$. For example, the abstract lines (14-17). Later also in Lines (65-77). I found this very confusing when I got to the main formal results. (See also my Question 1) 2. To me the real contribution is the negative result. Particu

Reviewer 03Rating 6Confidence 4

Strengths

The manuscript is reasonably well-written. The primary contribution—demonstrating the gap between Rademacher and Gaussian initialization—is an interesting result, particularly in the context of how initialization impacts neural network learning.

Weaknesses

* I would be careful when claiming negative results as they depend on additional noise injected to SGD, i.e., Z^t terms in Eq (3), which seems unnatural compared to other theoretical results related to ReLU networks. Is there a hope to extend to the negative results to SGD without additional noise injection? * The positive result in the paper notably involves training only the output layer, meaning there is no feature learning. However, it is unclear whether the authors consider training input

Code & Models

Repositories

ecornacchia/high-degree-parities
pytorchOfficial

Videos

Learning High-Degree Parities: The Crucial Role of the Initialization· slideslive

Taxonomy

TopicsSecond Language Learning and Teaching