SGD Finds then Tunes Features in Two-Layer Neural Networks with   near-Optimal Sample Complexity: A Case Study in the XOR problem

Margalit Glasgow

arXiv:2309.15111·cs.LG·October 3, 2023

SGD Finds then Tunes Features in Two-Layer Neural Networks with near-Optimal Sample Complexity: A Case Study in the XOR problem

Margalit Glasgow

PDF

Open Access

TL;DR

This paper demonstrates that stochastic gradient descent can efficiently learn the XOR function in a two-layer neural network with near-optimal sample complexity, revealing a two-phase feature learning process.

Contribution

It provides the first analysis showing SGD's ability to learn XOR with polylogarithmic samples, highlighting a two-phase feature evolution in training.

Findings

01

SGD achieves near-optimal sample complexity for XOR learning.

02

Network evolves through signal-finding and signal-heavy phases.

03

Training only a small fraction of neurons suffices for feature amplification.

Abstract

In this work, we consider the optimization process of minibatch stochastic gradient descent (SGD) on a 2-layer neural network with data separated by a quadratic ground truth function. We prove that with data drawn from the $d$ -dimensional Boolean hypercube labeled by the quadratic ``XOR'' function $y = - x_{i} x_{j}$ , it is possible to train to a population error $o (1)$ with $d polylog (d)$ samples. Our result considers simultaneously training both layers of the two-layer-neural network with ReLU activations via standard minibatch SGD on the logistic loss. To our knowledge, this work is the first to give a sample complexity of $\tilde{O} (d)$ for efficiently learning the XOR function on isotropic data on a standard neural network with standard training. Our main technique is showing that the network evolves in two phases: a $signal-finding$ phase where the network is small and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsStochastic Gradient Optimization Techniques · Neural Networks and Applications · Machine Learning and ELM

MethodsStochastic Gradient Descent