Drop Dropout on Single-Epoch Language Model Pretraining
Houjun Liu, John Bauer, Christopher D. Manning

TL;DR
This paper empirically demonstrates that removing dropout during single-epoch language model pretraining improves downstream task performance and model editability, challenging the traditional use of dropout in deep learning regularization.
Contribution
The study provides the first thorough empirical analysis showing dropout's negative impact in single-epoch LM pretraining and advocates for its removal in such settings.
Findings
Dropout removal improves downstream performance across multiple tasks.
Models trained without dropout are more effective in gradient-based editing.
Early dropout degrades performance compared to no dropout.
Abstract
Originally, dropout was seen as a breakthrough regularization technique that reduced overfitting and improved performance in almost all applications of deep learning by reducing overfitting. Yet, single-epoch pretraining tasks common to modern LLMs yield minimal overfitting, leading to dropout not being used for large LLMs. Nevertheless, no thorough empirical investigation has been done on the role of dropout in LM pretraining. Through experiments in single-epoch pretraining of both masked (BERT) and autoregressive (Pythia 160M and 1.4B) LMs with varying levels of dropout, we find that downstream performance in language modeling, morpho-syntax (BLiMP), question answering (SQuAD), and natural-language inference (MNLI) improves when dropout is not applied during pretraining. We additionally find that the recently-introduced "early dropout" also degrades performance over applying no…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Speech and dialogue systems
MethodsDropout
