Hallucinations in Code Change to Natural Language Generation: Prevalence and Evaluation of Detection Metrics

Chunhua Liu; Hong Yi Lin; Patanamon Thongtanunam

arXiv:2508.08661·cs.SE·August 13, 2025

Hallucinations in Code Change to Natural Language Generation: Prevalence and Evaluation of Detection Metrics

Chunhua Liu, Hong Yi Lin, Patanamon Thongtanunam

PDF

Open Access

TL;DR

This paper investigates the prevalence of hallucinations in code change to natural language tasks, such as commit message and code review comment generation, and evaluates metric-based detection methods, highlighting their strengths and limitations.

Contribution

It provides the first comprehensive analysis of hallucinations in code change tasks and assesses the effectiveness of various detection metrics, especially combined approaches.

Findings

01

Approximately 50% of code reviews contain hallucinations.

02

About 20% of commit messages have hallucinations.

03

Combining multiple metrics improves detection performance.

Abstract

Language models have shown strong capabilities across a wide range of tasks in software engineering, such as code generation, yet they suffer from hallucinations. While hallucinations have been studied independently in natural language and code generation, their occurrence in tasks involving code changes which have a structurally complex and context-dependent format of code remains largely unexplored. This paper presents the first comprehensive analysis of hallucinations in two critical tasks involving code change to natural language generation: commit message generation and code review comment generation. We quantify the prevalence of hallucinations in recent language models and explore a range of metric-based approaches to automatically detect them. Our findings reveal that approximately 50\% of generated code reviews and 20\% of generated commit messages contain hallucinations.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSoftware Engineering Research · Adversarial Robustness in Machine Learning · Security and Verification in Computing