Is More Data Worth the Cost? Dataset Scaling Laws in a Tiny Attention-Only Decoder

G\"otz-Henrik Wiegand; Lorena Raichle; Rico St\"adeli; Tomas Hrycej; Bernhard Bermeitinger; Siegfried Handschuh

arXiv:2604.09389·cs.LG·April 13, 2026

Is More Data Worth the Cost? Dataset Scaling Laws in a Tiny Attention-Only Decoder

G\"otz-Henrik Wiegand, Lorena Raichle, Rico St\"adeli, Tomas Hrycej, Bernhard Bermeitinger, Siegfried Handschuh

PDF

TL;DR

This study investigates how dataset size affects performance in a small, attention-only Transformer model, revealing diminishing returns and practical guidelines for data and compute trade-offs.

Contribution

It demonstrates dataset scaling laws in a controlled, small-scale setting using a simplified Transformer architecture, providing insights for resource-constrained environments.

Findings

01

Approximately 30% of data yields 90% of full accuracy

02

Performance improves smoothly with dataset size, showing diminishing returns

03

Provides practical guidance for balancing data and computational costs

Abstract

Training Transformer language models is expensive, as performance typically improves with increasing dataset size and computational budget. Although scaling laws describe this trend at large scale, their implications in controlled, smaller-scale settings remain less explored. In this work, we isolate dataset-size effects using a strongly reduced attention-only decoder architecture. By training on progressively larger power-of-two subsets, we observe smooth performance improvements accompanied by clear diminishing returns, consistent with scaling-law behavior. Using only about 30% of the training data is sufficient to reach approximately 90% of the full-data validation token-level accuracy. These results provide actionable insights into dataset scaling in a controlled, component-isolated setting and offer practical guidance for balancing dataset size and computational cost in compute-…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.