SGD Finds then Tunes Features in Two-Layer Neural Networks with near-Optimal Sample Complexity: A Case Study in the XOR problem
Margalit Glasgow

TL;DR
This paper demonstrates that stochastic gradient descent can efficiently learn the XOR function in a two-layer neural network with near-optimal sample complexity, revealing a two-phase feature learning process.
Contribution
It provides the first analysis showing SGD's ability to learn XOR with polylogarithmic samples, highlighting a two-phase feature evolution in training.
Findings
SGD achieves near-optimal sample complexity for XOR learning.
Network evolves through signal-finding and signal-heavy phases.
Training only a small fraction of neurons suffices for feature amplification.
Abstract
In this work, we consider the optimization process of minibatch stochastic gradient descent (SGD) on a 2-layer neural network with data separated by a quadratic ground truth function. We prove that with data drawn from the -dimensional Boolean hypercube labeled by the quadratic ``XOR'' function , it is possible to train to a population error with samples. Our result considers simultaneously training both layers of the two-layer-neural network with ReLU activations via standard minibatch SGD on the logistic loss. To our knowledge, this work is the first to give a sample complexity of for efficiently learning the XOR function on isotropic data on a standard neural network with standard training. Our main technique is showing that the network evolves in two phases: a phase where the network is small and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsStochastic Gradient Optimization Techniques · Neural Networks and Applications · Machine Learning and ELM
MethodsStochastic Gradient Descent
