The Effect of SGD Batch Size on Autoencoder Learning: Sparsity,   Sharpness, and Feature Learning

Nikhil Ghosh; Spencer Frei; Wooseok Ha; and Bin Yu

arXiv:2308.03215·stat.ML·August 8, 2023

The Effect of SGD Batch Size on Autoencoder Learning: Sparsity, Sharpness, and Feature Learning

Nikhil Ghosh, Spencer Frei, Wooseok Ha, and Bin Yu

PDF

Open Access

TL;DR

This paper explores how batch size in SGD affects autoencoder training, revealing that smaller batches lead to sparse, feature-selective solutions, while full batches produce dense, less feature-focused minima, with implications for understanding generalization.

Contribution

The study provides a detailed analysis of SGD dynamics on autoencoders, demonstrating how batch size influences sparsity, feature learning, and the sharpness of minima, supported by new convergence proof techniques.

Findings

01

Smaller batch sizes induce sparsity and feature selection.

02

Full batch gradient descent finds dense, less feature-specific minima.

03

Minima from full batch are flatter than those from smaller batches.

Abstract

In this work, we investigate the dynamics of stochastic gradient descent (SGD) when training a single-neuron autoencoder with linear or ReLU activation on orthogonal data. We show that for this non-convex problem, randomly initialized SGD with a constant step size successfully finds a global minimum for any batch size choice. However, the particular global minimum found depends upon the batch size. In the full-batch setting, we show that the solution is dense (i.e., not sparse) and is highly aligned with its initialized direction, showing that relatively little feature learning occurs. On the other hand, for any batch size strictly smaller than the number of samples, SGD finds a global minimum which is sparse and nearly orthogonal to its initialization, showing that the randomness of stochastic gradients induces a qualitatively different type of "feature selection" in this setting.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsStochastic Gradient Optimization Techniques · Model Reduction and Neural Networks · Machine Learning and ELM

MethodsStochastic Gradient Descent