TL;DR
This paper investigates how pretraining data primarily influences loss-to-loss scaling laws in LLMs, suggesting dataset curation is key for optimal performance, while model architecture and hyperparameters have limited impact.
Contribution
The study reveals that pretraining data dominates loss-to-loss scaling trends, providing guidance for dataset selection over architectural or hyperparameter tuning.
Findings
Pretraining data determines the scaling trend.
Model size and architecture have limited impact.
Dataset curation is crucial for downstream performance.
Abstract
Scaling laws guide the development of large language models (LLMs) by offering estimates for the optimal balance of model size, tokens, and compute. More recently, loss-to-loss scaling laws that relate losses across pretraining datasets and downstream tasks have emerged as a powerful tool for understanding and improving LLM performance and generalization. In this work, we investigate which factors most strongly influence loss-to-loss scaling. Our experiments reveal that the pretraining data determines the scaling trend. In contrast, model size, optimization hyperparameters, tokenizer and even significant architectural differences, such as between transformer-based models like Llama and state-space models like Mamba, generally have limited impact. Consequently, practitioners should carefully curate suitable pretraining datasets for optimal downstream performance, while architectures and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsFinTech, Crowdfunding, Digital Finance
MethodsMamba: Linear-Time Sequence Modeling with Selective State Spaces · LLaMA
