Nanbeige4-3B Technical Report: Exploring the Frontier of Small Language Models
Chen Yang, Guangyue Peng, Jiaying Zhu, Ran Le, Ruixiang Feng, Tao Zhang, Wei Ruan, Xiaoqi Liu, Xiaoxue Cheng, Xiyun Xu, Yang Song, Yanzipeng Gao, Yiming Jia, Yun Xing, Yuntao Wen, Zekai Wang, Zhenwei An, Zhicong Sun, Zongchao Chen

TL;DR
Nanbeige4-3B is a small yet high-performing language model family that leverages innovative training and fine-tuning techniques to outperform comparable models and rival larger ones across various benchmarks.
Contribution
The paper introduces novel training schedulers, data refinement mechanisms, and distillation methods that extend the capabilities of small language models beyond previous limits.
Findings
Nanbeige4-3B outperforms similar-sized models on multiple benchmarks.
The proposed training and fine-tuning techniques significantly improve model performance.
The model rivals larger models in reasoning and human alignment tasks.
Abstract
We present Nanbeige4-3B, a family of small-scale but high-performing language models. Pretrained on 23T high-quality tokens and finetuned on over 30 million diverse instructions, we extend the boundary of the scaling law for small language models. In pre-training, we design a Fine-Grained Warmup-Stable-Decay (FG-WSD) training scheduler, which progressively refines data mixtures across stages to boost model performance. In post-training, to improve the quality of the SFT data, we design a joint mechanism that integrates deliberative generation refinement and chain-of-thought reconstruction, yielding substantial gains on complex tasks. Following SFT, we employ our flagship reasoning model to distill Nanbeige4-3B through our proposed Dual Preference Distillation (DPD) method, which leads to further performance gains. Finally, a multi-stage reinforcement learning phase was applied,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- 🤗Nanbeige/Nanbeige4-3B-Thinking-2511model· 2.6k dl· ♡ 2052.6k dl♡ 205
- 🤗Nanbeige/Nanbeige4-3B-Basemodel· 3.3k dl· ♡ 603.3k dl♡ 60
- 🤗arnomatic/Nanbeige4-3B-Thinking-2511-hereticmodel· 7 dl· ♡ 47 dl♡ 4
- 🤗C10X/Nanbeige4-3B-Thinking-2511-Claude-4.5-Opus-High-Reasoning-Distillmodel· 532 dl532 dl
- 🤗C10X/Nanbeige4-3B-Thinking-2511-Claude-4.5-Opus-High-Reasoning-Distill-hereticmodel· 12 dl12 dl
- 🤗C10X/Nanbeige4-3B-Thinking-2511-Claude-4.5-Opus-High-Reasoning-Distill-V2model· 33 dl· ♡ 133 dl♡ 1
- 🤗C10X/Nanbeige4-3B-Thinking-2511-Claude-4.5-Opus-High-Reasoning-Distill-V2-hereticmodel· 62 dl· ♡ 462 dl♡ 4
- 🤗Mungert/Nanbeige4-3B-Thinking-2511-GGUFmodel· 213 dl· ♡ 1213 dl♡ 1
- 🤗AlekseyCalvin/Lyrical_ru2_en_NanBeige_3Bmodel· 3 dl3 dl
- 🤗Aptronym/Nanbeige4-3B-Base-heretic-1model· 1 dl1 dl
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Multimodal Machine Learning Applications · Natural Language Processing Techniques
