Defective Task Descriptions in LLM-Based Code Generation: Detection and Analysis
Amal Akli, Mike Papadakis, Maxime Cordy, Yves Le Traon

TL;DR
This paper introduces SpecValidator, a lightweight model that effectively detects defects in task descriptions for code generation, improving reliability across various benchmarks.
Contribution
Develops a parameter-efficient classifier, SpecValidator, for automatic detection of defective task descriptions, outperforming larger models in accuracy and generalization.
Findings
SpecValidator achieves F1=0.804, MCC=0.745 in defect detection.
It outperforms GPT-5-mini and Claude Sonnet 4 significantly.
Rich contextual benchmarks improve robustness against defects.
Abstract
Large language models are widely used for code generation, yet they rely on an implicit assumption that the task descriptions are sufficiently detailed and well-formed. However, in practice, users may provide defective descriptions, which can have a strong effect on code correctness. To address this issue, we develop SpecValidator, a lightweight classifier based on a small model that has been parameter-efficiently finetuned, to automatically detect task description defects. We evaluate SpecValidator on three types of defects, Lexical Vagueness, Under-Specification and Syntax-Formatting on 3 benchmarks with task descriptions of varying structure and complexity. Our results show that SpecValidator achieves defect detection of F1 = 0.804 and MCC = 0.745, significantly outperforming GPT-5-mini (F1 = 0.469 and MCC = 0.281) and Claude Sonnet 4 (F1 = 0.518 and MCC = 0.359). Perhaps more…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
