Evaluating Non-English Developer Support in Machine Learning for Software Engineering

Jonathan Katzy; Yongcheng Huang; Gopal-Raj Panchu; Maksym Ziemlewski; Paris Loizides; Sander Vermeulen; Arie van Deursen; Maliheh Izadi

arXiv:2605.05902·cs.SE·May 8, 2026

Evaluating Non-English Developer Support in Machine Learning for Software Engineering

Jonathan Katzy, Yongcheng Huang, Gopal-Raj Panchu, Maksym Ziemlewski, Paris Loizides, Sander Vermeulen, Arie van Deursen, Maliheh Izadi

PDF

TL;DR

This study assesses the performance of large language models in generating and evaluating non-English code comments, revealing significant challenges and the current limitations of automatic metrics and models in multilingual contexts.

Contribution

It provides a comprehensive evaluation of non-English code comment generation and introduces a human-annotated dataset with a taxonomy of error types for multilingual code evaluation.

Findings

01

Performance drops significantly outside English, with linguistic errors increasing up to 15.1 times.

02

Automatic metrics and LLM-based judges struggle to reliably evaluate non-English comments.

03

Human judgment remains essential for accurate assessment of multilingual code comments.

Abstract

Large Language Models are increasingly used in software engineering, but both code generation and its evaluation remain predominantly English-centric. This leaves a major gap in our understanding of how well current tools support multilingual development, where code contains non-English natural language. In this paper, we investigate non-English code comment generation and the reliability of current methods for evaluating such outputs. We evaluate five code LLMs (CodeGemma, CodeLlama, CodeQwen1.5, GraniteCode, and StarCoder2) across five natural languages: Dutch, English, Greek, Polish and Chinese. We further conduct an open-coding study of 12,500 generated comments, from which we derive a publicly released human-annotated dataset and a taxonomy of 26 error types. We use these human annotations, to evaluate the performance of neural metrics, and LLM-as-a-judge pipelines. Our findings…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.