Blessing of Class Diversity in Pre-training

Yulai Zhao; Jianshu Chen; Simon S. Du

arXiv:2209.03447·cs.LG·February 14, 2023

Blessing of Class Diversity in Pre-training

Yulai Zhao, Jianshu Chen, Simon S. Du

PDF

Open Access

TL;DR

This paper provides a statistical explanation for the success of pre-training in NLP, showing that diverse class sets enhance transfer learning efficiency by improving sample complexity bounds.

Contribution

It introduces a new theoretical analysis linking class diversity in pre-training to improved downstream task performance, with novel proof techniques.

Findings

01

Diverse classes in pre-training lead to larger least singular values in the last layer.

02

Pre-training with diverse classes improves transfer learning risk bounds.

03

Theoretical tools include a vector-form Rademacher complexity chain rule.

Abstract

This paper presents a new statistical analysis aiming to explain the recent superior achievements of the pre-training techniques in natural language processing (NLP). We prove that when the classes of the pre-training task (e.g., different words in the masked language model task) are sufficiently diverse, in the sense that the least singular value of the last linear layer in pre-training (denoted as $\tilde{ν}$ ) is large, then pre-training can significantly improve the sample efficiency of downstream tasks. Specially, we show the transfer learning excess risk enjoys an $O (\frac{1}{ν ~ n})$ rate, in contrast to the $O (\frac{1}{m})$ rate in the standard supervised learning. Here, $n$ is the number of pre-training data and $m$ is the number of data in the downstream task, and typically $n ≫ m$ . Our proof relies on a vector-form Rademacher…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMachine Learning and Data Classification · Machine Learning and Algorithms · Neural Networks and Applications

MethodsLinear Layer