How Programming Concepts and Neurons Are Shared in Code Language Models
Amir Hossein Kargaran, Yihong Liu, Fran\c{c}ois Yvon, Hinrich Sch\"utze

TL;DR
This paper investigates how large language models internally represent multiple programming languages and English, revealing that the concept space is closer to English and that language-specific neurons are layer-dependent, providing insights into model understanding.
Contribution
It introduces an analysis of the shared concept space between multiple programming languages and English in LLMs, highlighting layer-specific neuron activations and language alignment patterns.
Findings
Concept space is closer to English and PL keywords in intermediate layers.
Language-specific neurons are mainly in bottom layers, with PL-specific neurons in top layers.
Highly aligned PLs have larger keyword sets and are closer to the model's concept space.
Abstract
Several studies have explored the mechanisms of large language models (LLMs) in coding tasks, but most have focused on programming languages (PLs) in a monolingual setting. In this paper, we investigate the relationship between multiple PLs and English in the concept space of LLMs. We perform a few-shot translation task on 21 PL pairs using two Llama-based models. By decoding the embeddings of intermediate layers during this task, we observe that the concept space is closer to English (including PL keywords) and assigns high probabilities to English tokens in the second half of the intermediate layers. We analyze neuron activations for 11 PLs and English, finding that while language-specific neurons are primarily concentrated in the bottom layers, those exclusive to each PL tend to appear in the top layers. For PLs that are highly aligned with multiple other PLs, identifying…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsSoftware Engineering Research · Software Testing and Debugging Techniques · Advanced Software Engineering Methodologies
MethodsSparse Evolutionary Training
