Is More Data Worth the Cost? Dataset Scaling Laws in a Tiny Attention-Only Decoder
G\"otz-Henrik Wiegand, Lorena Raichle, Rico St\"adeli, Tomas Hrycej, Bernhard Bermeitinger, Siegfried Handschuh

TL;DR
This study investigates how dataset size affects performance in a small, attention-only Transformer model, revealing diminishing returns and practical guidelines for data and compute trade-offs.
Contribution
It demonstrates dataset scaling laws in a controlled, small-scale setting using a simplified Transformer architecture, providing insights for resource-constrained environments.
Findings
Approximately 30% of data yields 90% of full accuracy
Performance improves smoothly with dataset size, showing diminishing returns
Provides practical guidance for balancing data and computational costs
Abstract
Training Transformer language models is expensive, as performance typically improves with increasing dataset size and computational budget. Although scaling laws describe this trend at large scale, their implications in controlled, smaller-scale settings remain less explored. In this work, we isolate dataset-size effects using a strongly reduced attention-only decoder architecture. By training on progressively larger power-of-two subsets, we observe smooth performance improvements accompanied by clear diminishing returns, consistent with scaling-law behavior. Using only about 30% of the training data is sufficient to reach approximately 90% of the full-data validation token-level accuracy. These results provide actionable insights into dataset scaling in a controlled, component-isolated setting and offer practical guidance for balancing dataset size and computational cost in compute-…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
