Distributionally Robust Language Modeling
Yonatan Oren, Shiori Sagawa, Tatsunori B. Hashimoto, Percy Liang

TL;DR
This paper introduces a distributionally robust training method for language models that improves performance across unknown test distributions by minimizing worst-case losses over topic mixtures, demonstrated by significant perplexity reduction.
Contribution
It proposes a novel DRO approach called topic CVaR for training language models to perform well across diverse, unseen topic distributions without prior knowledge of test data.
Findings
Achieved 5.5 point perplexity reduction on Yelp reviews test set.
Demonstrated robustness of the model across different topic distributions.
Improved generalization compared to standard MLE training.
Abstract
Language models are generally trained on data spanning a wide range of topics (e.g., news, reviews, fiction), but they might be applied to an a priori unknown target distribution (e.g., restaurant reviews). In this paper, we first show that training on text outside the test distribution can degrade test performance when using standard maximum likelihood (MLE) training. To remedy this without the knowledge of the test distribution, we propose an approach which trains a model that performs well over a wide range of potential test distributions. In particular, we derive a new distributionally robust optimization (DRO) procedure which minimizes the loss of the model over the worst-case mixture of topics with sufficient overlap with the training distribution. Our approach, called topic conditional value at risk (topic CVaR), obtains a 5.5 point perplexity reduction over MLE when the language…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Speech Recognition and Synthesis
