Revisiting Training Scale: An Empirical Study of Token Count, Power Consumption, and Parameter Efficiency
Joe Dwyer

TL;DR
This study empirically investigates how increasing training token counts impacts power consumption, efficiency, and performance in large language models, revealing that larger token counts may reduce training efficiency despite marginal performance gains.
Contribution
It introduces an energy-aware parameter efficiency metric and demonstrates that higher token counts can lead to decreased training efficiency, emphasizing the importance of energy considerations in model training.
Findings
Power consumption and execution duration increase with token count.
Training efficiency declines monotonically as token count increases.
Marginal performance improvements do not justify higher energy costs.
Abstract
Research in machine learning has questioned whether increases in training token counts reliably produce proportional performance gains in large language models. Building on prior work introducing an energy-aware parameter efficiency metric, this study empirically examines the effects of increasing training token counts under fixed hardware and training conditions. The significance of this work lies in the explicit integration of power consumption and execution duration, as reflected by the power sampling frequency, into token-scale analysis. This addresses a gap in prior studies emphasizing performance outcomes while underrepresenting computational and energy costs. Using a repeated-measures experimental design on a constant GPU instance with an identical model architecture, optimizer settings, and epoch counts, a 1.1-billion-parameter TinyLlama model was trained at three token counts…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Big Data and Digital Economy · Machine Learning and Data Classification
