Can Programming Languages Boost Each Other via Instruction Tuning?
Daoguang Zan, Ailun Yu, Bo Shen, Jiaxin Zhang, Taihong Chen, Bing, Geng, Bei Chen, Jichuan Ji, Yafen Yao, Yongji Wang, Qianxiang Wang

TL;DR
This paper investigates whether instruction fine-tuning with multiple programming languages can mutually enhance code generation performance in large language models, demonstrating significant cross-language improvements.
Contribution
It introduces a comprehensive study showing that training on one programming language can improve performance on others during instruction tuning of large language models.
Findings
Python training improves Java pass@1 by 17.95%
HTML training improves Java pass@1 by 15.24%
Languages can significantly boost each other's code generation capabilities
Abstract
When human programmers have mastered a programming language, it would be easier when they learn a new programming language. In this report, we focus on exploring whether programming languages can boost each other during the instruction fine-tuning phase of code large language models. We conduct extensive experiments of 8 popular programming languages (Python, JavaScript, TypeScript, C, C++, Java, Go, HTML) on StarCoder. Results demonstrate that programming languages can significantly improve each other. For example, CodeM-Python 15B trained on Python is able to increase Java by an absolute 17.95% pass@1 on HumanEval-X. More surprisingly, we found that CodeM-HTML 7B trained on the HTML corpus can improve Java by an absolute 15.24% pass@1. Our training data is released at https://github.com/NL2Code/CodeM.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSoftware Engineering Research · Computational Physics and Python Applications · Scientific Computing and Data Management
MethodsFocus
