Small-to-Large Generalization: Data Influences Models Consistently Across Scale
Alaa Khaddaj, Logan Engstrom, Aleksander Madry

TL;DR
This paper investigates how training data distribution impacts language model predictions across different scales, revealing high correlation between small and large models and informing data attribution and dataset selection strategies.
Contribution
It demonstrates that data influences models consistently across scale and evaluates proxy models for understanding large-scale model behavior.
Findings
High correlation between small and large model predictions across data choices
Proxy models can effectively inform data attribution for large models
Data selection strategies benefit from insights gained through scaled-down models
Abstract
Choice of training data distribution greatly influences model behavior. Yet, in large-scale settings, precisely characterizing how changes in training data affects predictions is often difficult due to model training costs. Current practice is to instead extrapolate from scaled down, inexpensive-to-train proxy models. However, changes in data do not influence smaller and larger models identically. Therefore, understanding how choice of data affects large-scale models raises the question: how does training data distribution influence model behavior across compute scale? We find that small- and large-scale language model predictions (generally) do highly correlate across choice of training data. Equipped with these findings, we characterize how proxy scale affects effectiveness in two downstream proxy model applications: data attribution and dataset selection.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Explainable Artificial Intelligence (XAI) · Computational and Text Analysis Methods
