A Qualitative Investigation into LLM-Generated Multilingual Code Comments and Automatic Evaluation Metrics

Jonathan Katzy; Yongcheng Huang; Gopal-Raj Panchu; Maksym Ziemlewski; Paris Loizides; Sander Vermeulen; Arie van Deursen; Maliheh Izadi

arXiv:2505.15469·cs.SE·May 22, 2025

A Qualitative Investigation into LLM-Generated Multilingual Code Comments and Automatic Evaluation Metrics

Jonathan Katzy, Yongcheng Huang, Gopal-Raj Panchu, Maksym Ziemlewski, Paris Loizides, Sander Vermeulen, Arie van Deursen, Maliheh Izadi

PDF

Open Access

TL;DR

This study evaluates the multilingual capabilities of large language models in generating code comments, revealing significant challenges in accuracy and the unreliability of current evaluation metrics across diverse languages.

Contribution

It provides a comprehensive error taxonomy for multilingual code comments, assesses the reliability of evaluation metrics, and releases a large labeled dataset for future research.

Findings

01

Models often produce partially correct comments across languages.

02

Standard metrics fail to reliably distinguish correct from incorrect comments.

03

There is a significant overlap in metric scores between correct and incorrect comments.

Abstract

Large Language Models are essential coding assistants, yet their training is predominantly English-centric. In this study, we evaluate the performance of code language models in non-English contexts, identifying challenges in their adoption and integration into multilingual workflows. We conduct an open-coding study to analyze errors in code comments generated by five state-of-the-art code models, CodeGemma, CodeLlama, CodeQwen1.5, GraniteCode, and StarCoder2 across five natural languages: Chinese, Dutch, English, Greek, and Polish. Our study yields a dataset of 12,500 labeled generations, which we publicly release. We then assess the reliability of standard metrics in capturing comment \textit{correctness} across languages and evaluate their trustworthiness as judgment criteria. Through our open-coding investigation, we identified a taxonomy of 26 distinct error categories in…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Mathematics, Computing, and Information Processing