Scaling Laws for Code: Every Programming Language Matters

Jian Yang; Shawn Guo; Lin Jing; Wei Zhang; Aishan Liu; Chuan Hao; Zhoujun Li; Wayne Xin Zhao; Xianglong Liu; Weifeng Lv; Bryan Dai

arXiv:2512.13472·cs.CL·December 16, 2025

Scaling Laws for Code: Every Programming Language Matters

Jian Yang, Shawn Guo, Lin Jing, Wei Zhang, Aishan Liu, Chuan Hao, Zhoujun Li, Wayne Xin Zhao, Xianglong Liu, Weifeng Lv, Bryan Dai

PDF

Open Access 10 Models

TL;DR

This paper systematically explores the scaling laws of multilingual code pre-training for large language models, revealing how different programming languages impact performance and how strategic token allocation improves multilingual capabilities.

Contribution

It presents the first comprehensive study of multilingual scaling laws for code LLMs, including over 1000 experiments and a novel token allocation strategy based on language utility and synergy.

Findings

01

Interpreted languages like Python benefit more from increased model size and data.

02

Multilingual pre-training yields synergistic benefits among similar programming languages.

03

A proportion-dependent token allocation strategy outperforms uniform distribution, enhancing overall performance.

Abstract

Code large language models (Code LLMs) are powerful but costly to train, with scaling laws predicting performance from model size, data, and compute. However, different programming languages (PLs) have varying impacts during pre-training that significantly affect base model performance, leading to inaccurate performance prediction. Besides, existing works focus on language-agnostic settings, neglecting the inherently multilingual nature of modern software development. Therefore, it is first necessary to investigate the scaling laws of different PLs, and then consider their mutual influences to arrive at the final multilingual scaling law. In this paper, we present the first systematic exploration of scaling laws for multilingual code pre-training, conducting over 1000+ experiments (Equivalent to 336,000+ H800 hours) across multiple PLs, model sizes (0.2B to 14B parameters), and dataset…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSoftware Engineering Research · Machine Learning in Materials Science · Topic Modeling