LLMs on the Line: Data Determines Loss-to-Loss Scaling Laws

Prasanna Mayilvahanan; Thadd\"aus Wiedemer; Sayak Mallick; Matthias Bethge; Wieland Brendel

arXiv:2502.12120·cs.LG·May 21, 2026

LLMs on the Line: Data Determines Loss-to-Loss Scaling Laws

Prasanna Mayilvahanan, Thadd\"aus Wiedemer, Sayak Mallick, Matthias Bethge, Wieland Brendel

PDF

1 Video

TL;DR

This paper investigates how pretraining data primarily influences loss-to-loss scaling laws in LLMs, suggesting dataset curation is key for optimal performance, while model architecture and hyperparameters have limited impact.

Contribution

The study reveals that pretraining data dominates loss-to-loss scaling trends, providing guidance for dataset selection over architectural or hyperparameter tuning.

Findings

01

Pretraining data determines the scaling trend.

02

Model size and architecture have limited impact.

03

Dataset curation is crucial for downstream performance.

Abstract

Scaling laws guide the development of large language models (LLMs) by offering estimates for the optimal balance of model size, tokens, and compute. More recently, loss-to-loss scaling laws that relate losses across pretraining datasets and downstream tasks have emerged as a powerful tool for understanding and improving LLM performance and generalization. In this work, we investigate which factors most strongly influence loss-to-loss scaling. Our experiments reveal that the pretraining data determines the scaling trend. In contrast, model size, optimization hyperparameters, tokenizer and even significant architectural differences, such as between transformer-based models like Llama and state-space models like Mamba, generally have limited impact. Consequently, practitioners should carefully curate suitable pretraining datasets for optimal downstream performance, while architectures and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

LLMs on the Line: Data Determines Loss-to-Loss Scaling Laws· slideslive

Taxonomy

TopicsFinTech, Crowdfunding, Digital Finance

MethodsMamba: Linear-Time Sequence Modeling with Selective State Spaces · LLaMA