A Preliminary Study of Multilingual Code Language Models for Code Generation Task Using Translated Benchmarks
Rohit Dandamudi, Gema Rodr\'iguez-P\'erez

TL;DR
This study evaluates multilingual code language models using translated benchmarks, revealing alignment with training metrics but also inconsistencies and reproducibility challenges, emphasizing the need for further empirical validation.
Contribution
It provides the first empirical assessment of translated benchmarks for multilingual code generation models, highlighting their potential and limitations.
Findings
Translated benchmarks align with training perplexity metrics.
Inconsistencies observed across different translated benchmarks.
Reproducibility challenges identified in performance evaluation.
Abstract
Evaluating the performance of Code Language Models (CLMs) for software engineering tasks, especially in multilingual and low-resource programming language settings, poses significant challenges. These challenges are primarily due to the lack of high-quality benchmarks across various programming languages and the imbalanced nature of the CLMs training corpus. Although recent advances in one of the common downstream tasks, code generation, have shown promise by introducing translated benchmarks using different methodologies, there is a current lack of empirical evidence assessing these benchmarks. To address this gap, we conducted a preliminary study to evaluate the performance of Poly-Coder, a pioneering open-source, multilingual CLM built for code generation. We utilized two existing state-of-the-art translations of the popular code generation benchmark, HumanEval, facilitated by the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsALIGN
