The Fair Language Model Paradox
Andrea Pinto, Tomer Galanti, Randall Balestriero

TL;DR
This paper investigates the hidden biases in large language model training, revealing that weight decay disproportionately affects low-frequency tokens, which are crucial for linguistic fairness, highlighting the need for new regularization methods.
Contribution
It uncovers the subtle, token-level biases introduced by weight decay in LLM training, emphasizing the importance of fairness across token frequencies.
Findings
Weight decay biases low-frequency tokens across models and datasets.
Low-frequency tokens are underrepresented due to training dynamics.
Biases are detectable only at the token level, not in aggregate metrics.
Abstract
Large Language Models (LLMs) are widely deployed in real-world applications, yet little is known about their training dynamics at the token level. Evaluation typically relies on aggregated training loss, measured at the batch level, which overlooks subtle per-token biases arising from (i) varying token-level dynamics and (ii) structural biases introduced by hyperparameters. While weight decay is commonly used to stabilize training, we reveal that it silently introduces performance biases detectable only at the token level. In fact, we empirically show across different dataset sizes, model architectures and sizes ranging from 270M to 3B parameters that as weight decay increases, low-frequency tokens are disproportionately depreciated. This is particularly concerning, as these neglected low-frequency tokens represent the vast majority of the token distribution in most languages, calling…
Peer Reviews
Decision·Submitted to ICLR 2025
The paper brings forward a nuanced perspective on weight decay, highlighting an often-overlooked effect on low-frequency tokens in LLMs. This is particularly timely given the widespread use of weight decay without token-level monitoring. The study uses multiple models with varying architectures and sizes across different datasets, demonstrating the robustness of the findings.
The use of only the IMDB dataset (including an extended version) raises concerns about the generalizability of the results across other types of text data. Testing on a more varied set of corpora (e.g., diverse languages or topics) would strengthen the claims about low-frequency token bias. The paper’s theoretical discussion on the link between token frequency, regularization, and loss functions feels dense and somewhat disjointed from the empirical findings. A clearer integration of these theo
1. Innovative thinking on model weight decay for unbalanced class distribution data: the article proposes that the increased weight decay of large language models leads to model underperformance on low-frequency tokens and significantly better performance on high-frequency tokens, which can lead to model bias and unfairness. It triggers further thinking in the field of NLP on the contradiction between model generalization performance and model bias under long-tailed data, and focuses the attenti
1.dataset limitation: although the paper uses the IMDB dataset for experiments, the dataset is limited in types and domains, and may not be able to fully represent the model's performance in diverse tasks and domains. 2.Lack of different regularization comparison experiments: the paper lacks comparison experiments for the effects of different regularization techniques, for example, comparison with other types of regularization methods (e.g., dropout, data augmentation, etc.), which can make the
1. The paper presents a novel perspective for analyzing the performance of large language models. The author observed the difference in the learning high-frequency and low-frequency tokens, and identifies the cause of the differences, namely, the weight decay regularization technique. The experimental results demonstrate a significant correlation between weight decay and the loss of low-frequency tokens. 2. In addition to empirical conclusions, the authors also provide a theoretical disscussion
1. The experiments in this paper use the IMDB corpus for model training. However, this corpus is biased and differs significantly from mainstream pre-training corpora. Consequently, it may not adequately reflect potential issues in mainstream large language model training. 2. The experiments in this paper are based on training sequences of lengths 128 and 64, which are somewhat too short for large language model (LLM) training. For instance, in Figure 2, the tokenized tokens using the llama3 t
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsLegal Language and Interpretation · European and International Law Studies · Discrimination and Equality Law
MethodsWeight Decay
