A Taxonomy of Programming Languages for Code Generation
Nishat Raihan, Christian Newman, Marcos Zampieri

TL;DR
This paper introduces a systematic taxonomy classifying 646 programming languages into four resource tiers, revealing significant imbalance in language usage in code corpora, which aids in dataset curation and evaluation of multilingual LLMs.
Contribution
It establishes the first reproducible classification of programming languages by resource availability, providing a framework for better dataset curation and model evaluation.
Findings
Only 1.9% of languages (Tier 3) account for 74.6% of tokens.
71.7% of languages (Tier 0) contribute just 1.0%.
Imbalance in language usage is systematic and extreme.
Abstract
The world's 7,000+ languages vary widely in the availability of resources for NLP, motivating efforts to systematically categorize them by their degree of resourcefulness (Joshi et al., 2020). A similar disparity exists among programming languages (PLs); however, no resource-tier taxonomy has been established for code. As large language models (LLMs) grow increasingly capable of generating code, such a taxonomy becomes essential. To fill this gap, we present the first reproducible PL resource classification, grouping 646 languages into four tiers. We show that only 1.9% of languages (Tier 3, High) account for 74.6% of all tokens in seven major corpora, while 71.7% of languages (Tier 0, Scarce) contribute just 1.0%. Statistical analyses of within-tier inequality, dispersion, and distributional skew confirm that this imbalance is both extreme and systematic. Our results provide a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
