The Rise and Down of Babel Tower: Investigating the Evolution Process of Multilingual Code Large Language Model
Jiawei Chen, Wentao Chen, Jing Su, Jingjing Xu, Hongyu Lin, Mengjie, Ren, Yaojie Lu, Xianpei Han, Le Sun

TL;DR
This paper investigates how multilingual capabilities develop in large language models during pre-training, proposing the Babel Tower Hypothesis and a new corpus construction method that enhances multilingual performance.
Contribution
It introduces the Babel Tower Hypothesis to explain multilingual capability evolution and proposes a novel pre-training corpus construction method that improves multilingual LLM performance.
Findings
Internal state changes align with the Babel Tower Hypothesis
The new corpus construction method outperforms original corpus training
Multilingual capabilities develop from shared to language-specific knowledge
Abstract
Large language models (LLMs) have shown significant multilingual capabilities. However, the mechanisms underlying the development of these capabilities during pre-training are not well understood. In this paper, we use code LLMs as an experimental platform to explore the evolution of multilingual capabilities in LLMs during the pre-training process. Based on our observations, we propose the Babel Tower Hypothesis, which describes the entire process of LLMs acquiring new language capabilities. During the learning process, multiple languages initially share a single knowledge system dominated by the primary language and gradually develop language-specific knowledge systems. We then validate the above hypothesis by tracking the internal states of the LLMs through identifying working languages and language transferring neurons. Experimental results show that the internal state changes of…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques
