Blessing of Class Diversity in Pre-training
Yulai Zhao, Jianshu Chen, Simon S. Du

TL;DR
This paper provides a statistical explanation for the success of pre-training in NLP, showing that diverse class sets enhance transfer learning efficiency by improving sample complexity bounds.
Contribution
It introduces a new theoretical analysis linking class diversity in pre-training to improved downstream task performance, with novel proof techniques.
Findings
Diverse classes in pre-training lead to larger least singular values in the last layer.
Pre-training with diverse classes improves transfer learning risk bounds.
Theoretical tools include a vector-form Rademacher complexity chain rule.
Abstract
This paper presents a new statistical analysis aiming to explain the recent superior achievements of the pre-training techniques in natural language processing (NLP). We prove that when the classes of the pre-training task (e.g., different words in the masked language model task) are sufficiently diverse, in the sense that the least singular value of the last linear layer in pre-training (denoted as ) is large, then pre-training can significantly improve the sample efficiency of downstream tasks. Specially, we show the transfer learning excess risk enjoys an rate, in contrast to the rate in the standard supervised learning. Here, is the number of pre-training data and is the number of data in the downstream task, and typically . Our proof relies on a vector-form Rademacher…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMachine Learning and Data Classification · Machine Learning and Algorithms · Neural Networks and Applications
MethodsLinear Layer
