Beyond Language Boundaries: Uncovering Programming Language Families for Code Language Models

Shangbo Yun; Xiaodong Gu; Jianghong Huang; Beijun Shen

arXiv:2512.19509·cs.SE·December 23, 2025

Beyond Language Boundaries: Uncovering Programming Language Families for Code Language Models

Shangbo Yun, Xiaodong Gu, Jianghong Huang, Beijun Shen

PDF

Open Access

TL;DR

This paper uncovers latent programming language families using an embedding-based framework and demonstrates how leveraging these relationships can significantly improve multilingual code language models.

Contribution

It introduces a novel embedding-based approach to identify programming language families and proposes strategies to enhance multilingual code LLM training using these insights.

Findings

01

Hierarchical language relationships are clearly revealed.

02

Related languages form well-defined clusters.

03

Strategies improve multilingual LLM performance on code tasks.

Abstract

The rapid proliferation of diverse programming languages presents both opportunities and challenges for developing multilingual code LLMs. While existing techniques often train code LLMs by simply aggregating multilingual code data, few explore the deeper relationships between programming languages(PLs) and how such relationships can be utilized to optimize the training and inference of code LLMs. In this work, we investigate 2 fundamental questions: 1) What are the deep linguistic relationships among PLs? and 2) How can these relationships be leveraged to improve multilingual code LLMs? We propose an embedding-based framework to uncover the latent families of PLs. Our approach begins by defining 21 primary linguistic features of programming languages, such as variable definition, control structures, and method declarations, and then employs LLMs to generate feature-aligned code samples…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSoftware Engineering Research · Natural Language Processing Techniques · Topic Modeling