CodeFuse-CommitEval: Towards Benchmarking LLM's Power on Commit Message and Code Change Inconsistency Detection
Qingyu Zhang, Puzhuo Liu, Peng Di, Chenxiong Qian

TL;DR
This paper introduces CODEFUSE-COMMITEVAL, a benchmark dataset for evaluating large language models' ability to detect inconsistencies between commit messages and code changes, addressing a critical gap in version control quality assurance.
Contribution
It presents the first dedicated benchmark for message-code inconsistency detection, including a diverse dataset, multiple inconsistency types, and comprehensive evaluation of state-of-the-art LLMs with various prompting strategies.
Findings
Models detect inconsistencies more reliably than consistent commits.
GPT-OSS-20B performs best but uses more tokens.
Augmentation strategies have mixed effects on detection accuracy.
Abstract
Version control relies on commit messages to convey the rationale for code changes, but these messages are often low quality and, more critically, inconsistent with their diffs-known as message-code inconsistency (MCI). MCIs mislead reviewers, hinder maintenance, contaminate research datasets, and may obscure security patches. Yet, no dedicated benchmark exists to evaluate models for MCI detection. We introduce CODEFUSE-COMMITEVAL, the first benchmark designed for MCI detection using large language models (LLMs). Built on the ApacheCM dataset for diversity and quality, we generate seven types of inconsistent messages through rule-guided mutations of originally consistent commits and apply two-fold validation to verify both positive and negative samples. Using this labeled dataset of message-diff pairs, we evaluate six state-of-the-art open-source LLMs under a vanilla setting and with…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSoftware Engineering Research · Scientific Computing and Data Management · Web Application Security Vulnerabilities
