AutoScale: Scale-Aware Data Mixing for Pre-Training LLMs

Feiyang Kang; Yifan Sun; Bingbing Wen; Si Chen; Dawn Song; Rafid Mahmood; Ruoxi Jia

arXiv:2407.20177·cs.LG·October 3, 2025

AutoScale: Scale-Aware Data Mixing for Pre-Training LLMs

Feiyang Kang, Yifan Sun, Bingbing Wen, Si Chen, Dawn Song, Rafid Mahmood, Ruoxi Jia

PDF

Open Access 1 Repo

TL;DR

AutoScale introduces a scale-aware data mixing framework for pre-training large language models, optimizing data composition across different training scales to enhance efficiency and downstream performance.

Contribution

It proposes a novel two-stage method that predicts optimal data mixtures at larger scales based on small-scale experiments, with theoretical analysis guiding the extrapolation.

Findings

01

28% faster perplexity reduction on GPT-2 Large

02

Up to 38% speed-up over unweighted training

03

Improved downstream task performance

Abstract

Domain reweighting is an emerging research area aimed at adjusting the relative weights of different data sources to improve the effectiveness and efficiency of LLM pre-training. We show that data mixtures that perform well at smaller scales may not retain their advantage at larger scales, challenging the existing practice of determining competitive mixtures in small-scale experiments and directly applying them at much larger scales. To address this, we propose AutoScale, a two-stage, scale-aware data composition framework. First, AutoScale fits a parametric model that predicts the model's loss under different data compositions, then uses it to find an approximate best allocation at smaller, more manageable budgets. Next, leveraging a novel theoretical analysis of how optimal compositions evolve with scale, AutoScale extrapolates that composition to larger budgets without further…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

feiyang-k/autoscale
tfOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsData Quality and Management · Semantic Web and Ontologies · Data Mining Algorithms and Applications

MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Attention Is All You Need · Cosine Annealing · Linear Layer · Multi-Head Attention · Dense Connections · WordPiece · Residual Connection · Linear Warmup With Linear Decay · Dropout