CodeFuse-CommitEval: Towards Benchmarking LLM's Power on Commit Message and Code Change Inconsistency Detection

Qingyu Zhang; Puzhuo Liu; Peng Di; Chenxiong Qian

arXiv:2511.19875·cs.SE·November 26, 2025

CodeFuse-CommitEval: Towards Benchmarking LLM's Power on Commit Message and Code Change Inconsistency Detection

Qingyu Zhang, Puzhuo Liu, Peng Di, Chenxiong Qian

PDF

Open Access

TL;DR

This paper introduces CODEFUSE-COMMITEVAL, a benchmark dataset for evaluating large language models' ability to detect inconsistencies between commit messages and code changes, addressing a critical gap in version control quality assurance.

Contribution

It presents the first dedicated benchmark for message-code inconsistency detection, including a diverse dataset, multiple inconsistency types, and comprehensive evaluation of state-of-the-art LLMs with various prompting strategies.

Findings

01

Models detect inconsistencies more reliably than consistent commits.

02

GPT-OSS-20B performs best but uses more tokens.

03

Augmentation strategies have mixed effects on detection accuracy.

Abstract

Version control relies on commit messages to convey the rationale for code changes, but these messages are often low quality and, more critically, inconsistent with their diffs-known as message-code inconsistency (MCI). MCIs mislead reviewers, hinder maintenance, contaminate research datasets, and may obscure security patches. Yet, no dedicated benchmark exists to evaluate models for MCI detection. We introduce CODEFUSE-COMMITEVAL, the first benchmark designed for MCI detection using large language models (LLMs). Built on the ApacheCM dataset for diversity and quality, we generate seven types of inconsistent messages through rule-guided mutations of originally consistent commits and apply two-fold validation to verify both positive and negative samples. Using this labeled dataset of message-diff pairs, we evaluate six state-of-the-art open-source LLMs under a vanilla setting and with…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSoftware Engineering Research · Scientific Computing and Data Management · Web Application Security Vulnerabilities