Towards Understanding Learning in Neural Networks with Linear Teachers
Roei Sarussi, Alon Brutzkus, Amir Globerson

TL;DR
This paper proves that stochastic gradient descent can globally optimize a two-layer neural network with Leaky ReLU activations to learn linearly separable data, and explains why the resulting network often behaves approximately linearly.
Contribution
It provides the first theoretical proof of global optimization for this setting and links weight clustering to linear decision boundaries.
Findings
SGD globally optimizes the learning problem.
Networks often become approximately linear.
Weight clustering implies linear decision boundaries.
Abstract
Can a neural network minimizing cross-entropy learn linearly separable data? Despite progress in the theory of deep learning, this question remains unsolved. Here we prove that SGD globally optimizes this learning problem for a two-layer network with Leaky ReLU activations. The learned network can in principle be very complex. However, empirical evidence suggests that it often turns out to be approximately linear. We provide theoretical support for this phenomenon by proving that if network weights converge to two weight clusters, this will imply an approximately linear decision boundary. Finally, we show a condition on the optimization that leads to weight clustering. We provide empirical results that validate our theoretical analysis.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsStochastic Gradient Optimization Techniques · Reinforcement Learning in Robotics · Adversarial Robustness in Machine Learning
Methods*Communicated@Fast*How Do I Communicate to Expedia? · HuMan(Expedia)||How do I get a human at Expedia? · Stochastic Gradient Descent
