Scaling Laws for Code: Every Programming Language Matters
Jian Yang, Shawn Guo, Lin Jing, Wei Zhang, Aishan Liu, Chuan Hao, Zhoujun Li, Wayne Xin Zhao, Xianglong Liu, Weifeng Lv, Bryan Dai

TL;DR
This paper systematically explores the scaling laws of multilingual code pre-training for large language models, revealing how different programming languages impact performance and how strategic token allocation improves multilingual capabilities.
Contribution
It presents the first comprehensive study of multilingual scaling laws for code LLMs, including over 1000 experiments and a novel token allocation strategy based on language utility and synergy.
Findings
Interpreted languages like Python benefit more from increased model size and data.
Multilingual pre-training yields synergistic benefits among similar programming languages.
A proportion-dependent token allocation strategy outperforms uniform distribution, enhancing overall performance.
Abstract
Code large language models (Code LLMs) are powerful but costly to train, with scaling laws predicting performance from model size, data, and compute. However, different programming languages (PLs) have varying impacts during pre-training that significantly affect base model performance, leading to inaccurate performance prediction. Besides, existing works focus on language-agnostic settings, neglecting the inherently multilingual nature of modern software development. Therefore, it is first necessary to investigate the scaling laws of different PLs, and then consider their mutual influences to arrive at the final multilingual scaling law. In this paper, we present the first systematic exploration of scaling laws for multilingual code pre-training, conducting over 1000+ experiments (Equivalent to 336,000+ H800 hours) across multiple PLs, model sizes (0.2B to 14B parameters), and dataset…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- 🤗IQuestLab/IQuest-Coder-V1-7B-Instructmodel· 2.2k dl· ♡ 172.2k dl♡ 17
- 🤗IQuestLab/IQuest-Coder-V1-7B-Thinkingmodel· 423 dl· ♡ 9423 dl♡ 9
- 🤗IQuestLab/IQuest-Coder-V1-40B-Instructmodel· 12k dl· ♡ 28912k dl♡ 289
- 🤗IQuestLab/IQuest-Coder-V1-40B-Loop-Instructmodel· 12k dl· ♡ 32412k dl♡ 324
- 🤗IQuestLab/IQuest-Coder-V1-40B-Thinkingmodel· 330 dl· ♡ 16330 dl♡ 16
- 🤗IQuestLab/IQuest-Coder-V1-40B-Loop-Thinkingmodel· 162 dl· ♡ 12162 dl♡ 12
- 🤗IQuestLab/IQuest-Coder-V1-7B-Basemodel· 113 dl· ♡ 10113 dl♡ 10
- 🤗IQuestLab/IQuest-Coder-V1-40B-Base-Stage1model· 23 dl· ♡ 2823 dl♡ 28
- 🤗IQuestLab/IQuest-Coder-V1-40B-Basemodel· 114 dl· ♡ 46114 dl♡ 46
- 🤗cyankiwi/IQuest-Coder-V1-40B-Instruct-AWQ-4bitmodel· 26 dl· ♡ 326 dl♡ 3
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSoftware Engineering Research · Machine Learning in Materials Science · Topic Modeling
