Uncovering Systematic Failures of LLMs in Verifying Code Against Natural Language Specifications

Haolin Jin; Huaming Chen

arXiv:2508.12358·cs.SE·August 19, 2025

Uncovering Systematic Failures of LLMs in Verifying Code Against Natural Language Specifications

Haolin Jin, Huaming Chen

PDF

Open Access

TL;DR

This paper reveals that large language models often fail to accurately verify code against natural language specifications, especially with complex prompts, raising concerns about their reliability in code review tasks.

Contribution

It uncovers systematic failures of LLMs in code verification against requirements and proposes improved prompting strategies to mitigate these issues.

Findings

01

LLMs frequently misjudge correct code as incorrect or defective.

02

Complex prompting techniques increase misjudgment rates.

03

Root causes of misjudgments are analyzed and mitigation strategies are proposed.

Abstract

Large language models (LLMs) have become essential tools in software development, widely used for requirements engineering, code generation and review tasks. Software engineers often rely on LLMs to assess whether system code implementation satisfy task requirements, thereby enhancing code robustness and accuracy. However, it remains unclear whether LLMs can reliably determine whether the code complies fully with the given task descriptions, which is usually natural language specifications. In this paper, we uncover a systematic failure of LLMs in evaluating whether code aligns with natural language requirements. Specifically, with widely used benchmarks, we employ unified prompts to judge code correctness. Our results reveal that LLMs frequently misclassify correct code implementations as either ``not satisfying requirements'' or containing potential defects. Surprisingly, more complex…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Digital Rights Management and Security · Web Application Security Vulnerabilities