Just How Flexible are Neural Networks in Practice?

Ravid Shwartz-Ziv; Micah Goldblum; Arpit Bansal; C. Bayan; Bruss; Yann LeCun; Andrew Gordon Wilson

arXiv:2406.11463·cs.LG·June 18, 2024·1 cites

Just How Flexible are Neural Networks in Practice?

Ravid Shwartz-Ziv, Micah Goldblum, Arpit Bansal, C. Bayan, Bruss, Yann LeCun, Andrew Gordon Wilson

PDF

Open Access 3 Reviews

TL;DR

This paper investigates the practical flexibility of neural networks, revealing that standard training methods limit their ability to fit data compared to theoretical capacity, with implications for model efficiency and generalization.

Contribution

It provides empirical insights into how optimization and architecture influence the data fitting capacity of neural networks in practice.

Findings

01

Optimizers find minima fitting fewer samples than parameters suggest

02

Convolutional networks are more parameter-efficient than MLPs and ViTs

03

SGD fits more data than full-batch gradient descent

Abstract

It is widely believed that a neural network can fit a training set containing at least as many samples as it has parameters, underpinning notions of overparameterized and underparameterized models. In practice, however, we only find solutions accessible via our training procedure, including the optimizer and regularizers, limiting flexibility. Moreover, the exact parameterization of the function class, built into an architecture, shapes its loss surface and impacts the minima we find. In this work, we examine the ability of neural networks to fit data in practice. Our findings indicate that: (1) standard optimizers find minima where the model can only fit training sets with significantly fewer samples than it has parameters; (2) convolutional networks are more parameter-efficient than MLPs and ViTs, even on randomly labeled data; (3) while stochastic training is thought to have a…

Peer Reviews

Decision·Submitted to ICLR 2025

Reviewer 01Rating 5Confidence 3

Strengths

**Originality:** The paper's primary innovation lies in systematically quantifying the gap between theoretical and practical neural network capacity. While building on Nakkiran's EMC metric, it makes three notable advances: (1) demonstrating that SGD solutions enable fitting more samples than full-batch gradient descent, challenging the conventional wisdom about SGD's purely regularizing role, (2) showing that CNNs maintain parameter efficiency advantages even on random data, suggesting fundamen

Weaknesses

**Key Technical Limitations and Suggested Improvements:** 1. **Theoretical Foundation for SGD Findings** The paper's most striking result - that SGD enables fitting more samples than full-batch GD (Figure 3b) - lacks theoretical analysis. While empirically robust, understanding why this occurs is crucial (please let me know if I'm missing something). The authors should investigate whether this results from: - Loss landscape exploration properties (could analyze loss surface geometry using r

Reviewer 02Rating 6Confidence 4

Strengths

1. The paper is written clearly and easy to understand. 2. The influence of architectures, optimizers, and activation functions on model capacity is interesting.

Weaknesses

1. The reasoning behind why SGD converges to solutions that fit fewer samples than parameter count is not clear. Authors should provide a step-by-step explanation of the mechanism by which SGD leads to solutions that fit fewer samples. It will be better to include a comparison with full-batch gradient descent to highlight the specific role of stochasticity in this phenomenon. 2. In Figure 1, CIFAR-10 CNN and CIFAR-10 MLP have EMC values approximately close to each other for higher values of par

Reviewer 03Rating 5Confidence 3

Strengths

The paper is well-written and easy to follow. The results involve lots of experimental observations. This topic may be an interesting direction.

Weaknesses

1. When investigating the relation between architectures and EMC, it is hard to compare the different architectures. The shape of the architecture may have a large impact on the ability of networks, so the paper needs to explain more about the comparison among different architectures. 2. It seems like the paper summarizes and explains the results obtained by experiments without the underlying reasons. For instance, the paper states that only ReLU improves the network's ability among all the acti

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNeural Networks and Applications

MethodsSparse Evolutionary Training · Stochastic Gradient Descent