Are LLMs Reliable Code Reviewers? Systematic Overcorrection in Requirement Conformance Judgement
Haolin Jin, Huaming Chen

TL;DR
This paper reveals that large language models often misjudge code compliance with natural language requirements, especially with detailed prompts, exposing reliability issues in LLM-based code review systems.
Contribution
The study systematically uncovers LLM failures in code requirement conformance judgment and proposes a fix-guided verification method to improve reliability.
Findings
LLMs frequently misclassify correct code as non-compliant.
More detailed prompts increase misjudgment rates.
The proposed verification filter improves review accuracy.
Abstract
Large language models (LLMs) have become essential tools in software development, widely used for requirements engineering, code generation and review tasks. Software engineers often rely on LLMs to verify if code implementation satisfy task requirements, thereby ensuring code robustness and accuracy. However, it remains unclear whether LLMs can reliably determine code against the given task descriptions, which is usually in a form of natural language specifications. In this paper, we uncover a systematic failure of LLMs in matching code to natural language requirements. Specifically, with widely adopted benchmarks and unified prompts design, we demonstrate that LLMs frequently misclassify correct code implementation as non-compliant or defective. Surprisingly, we find that more detailed prompt design, particularly with those requiring explanations and proposed corrections, leads to…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSoftware Engineering Research · Software Engineering Techniques and Practices · Model-Driven Software Engineering Techniques
