MUCOCO: Automated Consistency Testing of Code LLMs

Chua Jin Chou; Khant That Lwin; Ezekiel Soremekun

arXiv:2604.19086·cs.SE·April 22, 2026

MUCOCO: Automated Consistency Testing of Code LLMs

Chua Jin Chou, Khant That Lwin, Ezekiel Soremekun

PDF

TL;DR

MUCOCO is an automated method that uses semantic-preserving mutations to detect inconsistent behaviors in code LLMs, addressing a gap in existing static benchmarks.

Contribution

We introduce MUCOCO, a novel automated consistency testing approach that uncovers inconsistent program behaviors in code LLMs through mutation analysis.

Findings

01

MUCOCO exposes inconsistencies in about 15% of generated inputs.

02

It outperforms the baseline method TURBULENCE.

03

Effective across four coding tasks and seven LLMs.

Abstract

Code LLMs often portray inconsistent program behaviors. Developers typically employ benchmarks to assess Code LLMs, but most benchmarks are hand-crafted, static and do not target consistency property. In this work, we pose the scientific question: how can we automatically discover inconsistent program behaviors in Code LLMs? To address this challenge, we propose an automated consistency testing method, called MUCOCO, which employs semantic-preserving mutation analysis to expose inconsistent behaviors in code LLMs. Given a coding query, MUCOCO automatically transforms its program into semantically equivalent programs (aka mutants) and detects inconsistencies between the mutants and the original program (e.g., different output or test failure). We evaluate MUCOCO using four (4) coding tasks and seven (7) LLMs. Results show that MUCOCO is effective in exposing inconsistency and outperforms…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.