The Effect of Model Size on Worst-Group Generalization

Alan Pham; Eunice Chan; Vikranth Srivatsa; Dhruba Ghosh; Yaoqing Yang,; Yaodong Yu; Ruiqi Zhong; Joseph E. Gonzalez; Jacob Steinhardt

arXiv:2112.04094·cs.LG·December 9, 2021

The Effect of Model Size on Worst-Group Generalization

Alan Pham, Eunice Chan, Vikranth Srivatsa, Dhruba Ghosh, Yaoqing Yang,, Yaodong Yu, Ruiqi Zhong, Joseph E. Gonzalez, Jacob Steinhardt

PDF

Open Access

TL;DR

This study systematically examines how increasing model size affects worst-group generalization in ERM, finding that larger pre-trained models improve performance even without subgroup labels across vision and NLP tasks.

Contribution

It provides the first comprehensive analysis of model size effects on worst-group generalization across multiple architectures, domains, and initialization methods.

Findings

01

Larger models do not harm and may improve worst-group test accuracy.

02

Pre-trained larger models consistently outperform smaller ones on Waterbirds and MultiNLI.

03

Increasing model size is recommended when subgroup labels are unavailable.

Abstract

Overparameterization is shown to result in poor test accuracy on rare subgroups under a variety of settings where subgroup information is known. To gain a more complete picture, we consider the case where subgroup information is unknown. We investigate the effect of model size on worst-group generalization under empirical risk minimization (ERM) across a wide range of settings, varying: 1) architectures (ResNet, VGG, or BERT), 2) domains (vision or natural language processing), 3) model size (width or depth), and 4) initialization (with pre-trained or random weights). Our systematic evaluation reveals that increasing model size does not hurt, and may help, worst-group test performance under ERM across all setups. In particular, increasing pre-trained model size consistently improves performance on Waterbirds and MultiNLI. We advise practitioners to use larger pre-trained models when…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsDomain Adaptation and Few-Shot Learning · Topic Modeling · Machine Learning and Data Classification

MethodsMax Pooling · Softmax · Dense Connections · Dropout · Convolution