The Limitations of Large Width in Neural Networks: A Deep Gaussian Process Perspective
Geoff Pleiss, John P. Cunningham

TL;DR
This paper investigates how increasing width in neural networks, through the lens of Deep Gaussian Processes, can actually hinder performance by causing models to behave more like Gaussian processes, revealing an optimal width for best results.
Contribution
It provides a theoretical and empirical analysis showing that large width can be detrimental, identifying a 'sweet spot' for width in Deep Gaussian Processes and relating findings to conventional neural networks.
Findings
Large width causes Deep GP models to converge to Gaussian processes.
Optimal width for maximum performance is around 1 or 2 units.
Further increasing width beyond the optimal degrades performance.
Abstract
Large width limits have been a recent focus of deep learning research: modulo computational practicalities, do wider networks outperform narrower ones? Answering this question has been challenging, as conventional networks gain representational power with width, potentially masking any negative effects. Our analysis in this paper decouples capacity and width via the generalization of neural networks to Deep Gaussian Processes (Deep GP), a class of nonparametric hierarchical models that subsume neural nets. In doing so, we aim to understand how width affects (standard) neural networks once they have sufficient capacity for a given modeling task. Our theoretical and empirical results on Deep GP suggest that large width can be detrimental to hierarchical models. Surprisingly, we prove that even nonparametric Deep GP converge to Gaussian processes, effectively becoming shallower without any…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsGaussian Processes and Bayesian Inference · Statistical Mechanics and Entropy · Machine Learning and Data Classification
