CodeTransOcean: A Comprehensive Multilingual Benchmark for Code Translation
Weixiang Yan, Yuchen Tian, Yunzhe Li, Qian Chen, Wen Wang

TL;DR
CodeTransOcean introduces the largest multilingual benchmark for code translation, including diverse datasets and evaluation metrics, to advance research in translating code across multiple languages and frameworks.
Contribution
It presents a comprehensive multilingual benchmark with novel datasets, evaluation metrics, and modeling approaches for diverse code translation tasks.
Findings
Multilingual models improve translation quality for low-resource languages.
The benchmark enables evaluation of code translation across frameworks and niche languages.
LLMs like ChatGPT show potential but face challenges in code translation accuracy.
Abstract
Recent code translation techniques exploit neural machine translation models to translate source code from one programming language to another to satisfy production compatibility or to improve efficiency of codebase maintenance. Most existing code translation datasets only focus on a single pair of popular programming languages. To advance research on code translation and meet diverse requirements of real-world applications, we construct CodeTransOcean, a large-scale comprehensive benchmark that supports the largest variety of programming languages for code translation. CodeTransOcean consists of three novel multilingual datasets, namely, MultilingualTrans supporting translations between multiple popular programming languages, NicheTrans for translating between niche programming languages and popular ones, and LLMTrans for evaluating executability of translated code by large language…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSoftware Engineering Research · Machine Learning and Data Classification · Topic Modeling
MethodsFocus
