Uncovering Systematic Failures of LLMs in Verifying Code Against Natural Language Specifications
Haolin Jin, Huaming Chen

TL;DR
This paper reveals that large language models often fail to accurately verify code against natural language specifications, especially with complex prompts, raising concerns about their reliability in code review tasks.
Contribution
It uncovers systematic failures of LLMs in code verification against requirements and proposes improved prompting strategies to mitigate these issues.
Findings
LLMs frequently misjudge correct code as incorrect or defective.
Complex prompting techniques increase misjudgment rates.
Root causes of misjudgments are analyzed and mitigation strategies are proposed.
Abstract
Large language models (LLMs) have become essential tools in software development, widely used for requirements engineering, code generation and review tasks. Software engineers often rely on LLMs to assess whether system code implementation satisfy task requirements, thereby enhancing code robustness and accuracy. However, it remains unclear whether LLMs can reliably determine whether the code complies fully with the given task descriptions, which is usually natural language specifications. In this paper, we uncover a systematic failure of LLMs in evaluating whether code aligns with natural language requirements. Specifically, with widely used benchmarks, we employ unified prompts to judge code correctness. Our results reveal that LLMs frequently misclassify correct code implementations as either ``not satisfying requirements'' or containing potential defects. Surprisingly, more complex…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Digital Rights Management and Security · Web Application Security Vulnerabilities
