Code Comprehension then Auditing for Unsupervised LLM Evaluation
Bhrij Patel, Souradip Chakraborty, Mengdi Wang, Dinesh Manocha, Amrit Singh Bedi

TL;DR
CoCoA introduces a two-step unsupervised framework for code correctness evaluation, improving interpretability and accuracy by first understanding code functionality before assessing correctness.
Contribution
It proposes a novel sequential approach that separates code comprehension from correctness evaluation, enhancing reliability over prior joint inference methods.
Findings
Achieves up to 68% higher F1 score compared to baselines.
Increases accuracy by up to 20% across datasets and languages.
Improves interpretability by generating natural-language explanations.
Abstract
Large Language Models (LLMs) for unsupervised code correctness evaluation have recently gained attention because they can judge if code runs as intended without requiring reference implementations or unit tests, which may be unavailable, sparse, or unreliable. However, most prior approaches condition LLM evaluators directly on the full code implementation, forcing the model to jointly infer program behavior and evaluate correctness in a single step. This entanglement leads to misinterpretations of code behavior and unreliable judgments. To mitigate this issue, we introduce CoCoA, an unsupervised Code Comprehension then Auditing framework that first comprehends functionality to generate a natural-language explanation. Then it evaluates task alignment based on this explanation. By sequentially sampling comprehension before evaluation, CoCoA improves the quality of inferred program…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
