Hallucinations in Code Change to Natural Language Generation: Prevalence and Evaluation of Detection Metrics
Chunhua Liu, Hong Yi Lin, Patanamon Thongtanunam

TL;DR
This paper investigates the prevalence of hallucinations in code change to natural language tasks, such as commit message and code review comment generation, and evaluates metric-based detection methods, highlighting their strengths and limitations.
Contribution
It provides the first comprehensive analysis of hallucinations in code change tasks and assesses the effectiveness of various detection metrics, especially combined approaches.
Findings
Approximately 50% of code reviews contain hallucinations.
About 20% of commit messages have hallucinations.
Combining multiple metrics improves detection performance.
Abstract
Language models have shown strong capabilities across a wide range of tasks in software engineering, such as code generation, yet they suffer from hallucinations. While hallucinations have been studied independently in natural language and code generation, their occurrence in tasks involving code changes which have a structurally complex and context-dependent format of code remains largely unexplored. This paper presents the first comprehensive analysis of hallucinations in two critical tasks involving code change to natural language generation: commit message generation and code review comment generation. We quantify the prevalence of hallucinations in recent language models and explore a range of metric-based approaches to automatically detect them. Our findings reveal that approximately 50\% of generated code reviews and 20\% of generated commit messages contain hallucinations.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSoftware Engineering Research · Adversarial Robustness in Machine Learning · Security and Verification in Computing
