Rho-1: Not All Tokens Are What You Need
Zhenghao Lin, Zhibin Gou, Yeyun Gong, Xiao Liu, Yelong Shen, Ruochen, Xu, Chen Lin, Yujiu Yang, Jian Jiao, Nan Duan, Weizhu Chen

TL;DR
Rho-1 introduces Selective Language Modeling, training on useful tokens identified by scoring, which improves efficiency and accuracy in math and diverse tasks compared to traditional methods.
Contribution
The paper proposes a novel selective training approach that focuses on useful tokens, significantly enhancing language model performance and efficiency.
Findings
Up to 30% improvement in few-shot accuracy on math tasks
State-of-the-art results on MATH dataset with fewer tokens
6.8% average performance boost across diverse tasks
Abstract
Previous language model pre-training methods have uniformly applied a next-token prediction loss to all training tokens. Challenging this norm, we posit that "9l training". Our initial analysis examines token-level training dynamics of language model, revealing distinct loss patterns for different tokens. Leveraging these insights, we introduce a new language model called Rho-1. Unlike traditional LMs that learn to predict every next token in a corpus, Rho-1 employs Selective Language Modeling (SLM), which selectively trains on useful tokens that aligned with the desired distribution. This approach involves scoring pretraining tokens using a reference model, and then training the language model with a focused loss on tokens with higher scores. When continual pretraining on 15B OpenWebMath corpus, Rho-1 yields an absolute improvement in few-shot accuracy of up to 30% in 9 math tasks.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- 🤗microsoft/rho-math-1b-v0.1model· 82 dl· ♡ 1582 dl♡ 15
- 🤗microsoft/rho-math-1b-interpreter-v0.1model· 42 dl· ♡ 442 dl♡ 4
- 🤗microsoft/rho-math-7b-v0.1model· 47 dl· ♡ 2047 dl♡ 20
- 🤗microsoft/rho-math-7b-interpreter-v0.1model· 35 dl· ♡ 3535 dl♡ 35
- 🤗arzeth/rho-math-7b-interpreter-v0.1.imatrix-GGUFmodel· 11 dl· ♡ 211 dl♡ 2
- 🤗hflog/microsoft-rho-math-1b-v0.1model· 1 dl1 dl
- 🤗QuantFactory/rho-math-1b-interpreter-v0.1-GGUFmodel· 135 dl135 dl
- 🤗QuantFactory/rho-math-1b-v0.1-GGUFmodel· 111 dl111 dl
- 🤗RichardErkhov/microsoft_-_rho-math-1b-v0.1-4bitsmodel· 2 dl2 dl
- 🤗RichardErkhov/microsoft_-_rho-math-1b-v0.1-8bitsmodel· 2 dl2 dl
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Speech Recognition and Synthesis
