LMD3: Language Model Data Density Dependence
John Kirchenbauer, Garrett Honke, Gowthami Somepalli, Jonas Geiping,, Daphne Ippolito, Katherine Lee, Tom Goldstein, David Andre

TL;DR
This paper introduces a methodology to analyze how the density of training data influences language model performance on individual examples, providing insights into data dependence and model behavior.
Contribution
It presents a novel framework for estimating training data density at the example level and linking it to model performance and perplexity, enhancing understanding of data dependence.
Findings
Higher training data support correlates with improved test performance.
Density measurements can predict model perplexity variance.
The framework offers statistical evidence of data dependence in language models.
Abstract
We develop a methodology for analyzing language model task performance at the individual example level based on training data density estimation. Experiments with paraphrasing as a controlled intervention on finetuning data demonstrate that increasing the support in the training distribution for specific test queries results in a measurable increase in density, which is also a significant predictor of the performance increase caused by the intervention. Experiments with pretraining data demonstrate that we can explain a significant fraction of the variance in model perplexity via density measurements. We conclude that our framework can provide statistical evidence of the dependence of a target model's predictions on subsets of its training data, and can more generally be used to characterize the support (or lack thereof) in the training data for a given test task.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Semantic Web and Ontologies
