MiLe Loss: a New Entropy-Weighed Loss for Mitigating the Bias of Learning Difficulties in Large Language Models
Zhenpeng Su, Xing Wu, Xue Bai, Zijia Lin, Hui Chen, Guiguang Ding, Wei Zhou, Songlin Hu

TL;DR
This paper introduces MiLe Loss, a novel entropy-based loss function that dynamically emphasizes difficult-to-learn tokens during training, improving large language models' performance on downstream tasks.
Contribution
The paper proposes a new entropy-weighted loss function that adaptively focuses on infrequent and challenging tokens, addressing bias in training large language models.
Findings
Models with MiLe Loss outperform baselines on downstream benchmarks.
MiLe Loss improves learning of infrequent tokens.
Performance gains are consistent across different model scales.
Abstract
Generative language models are usually pretrained on large text corpus via predicting the next token (i.e., sub-word/word/phrase) given the previous ones. Recent works have demonstrated the impressive performance of large generative language models on downstream tasks. However, existing generative language models generally neglect an inherent challenge in text corpus during training, i.e., the imbalance between frequent tokens and infrequent ones. It can lead a language model to be dominated by common and easy-to-learn tokens, thereby overlooking the infrequent and difficult-to-learn ones. To alleviate that, we propose a MiLe Loss function for mitigating the bias of learning difficulties with tokens. During training, it can dynamically assess the learning difficulty of a to-be-learned token, according to the information entropy of the corresponding predicted probability distribution…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Text Readability and Simplification
MethodsFocus
