Refining Critical Thinking in LLM Code Generation: A Faulty Premise-based Evaluation Framework
Jialin Li, Jinzhe Li, Gengxu Li, Yi Chang, Yuan Wu

TL;DR
This paper introduces FPBench, a novel evaluation framework for assessing how large language models handle faulty premises in code generation, revealing their limited self-scrutiny and reasoning abilities under such conditions.
Contribution
It presents the first systematic framework for evaluating LLMs with faulty premises, including constructing premise categories and multi-dimensional metrics, to analyze model deficiencies.
Findings
Most models perform poorly with faulty premises.
Increasing prompt length does not improve code quality under faults.
Distinct faulty premise types activate different model defect patterns.
Abstract
With the advancement of code generation capabilities in large language models (LLMs), their reliance on input premises has intensified. When users provide inputs containing faulty premises, the probability of code generation hallucinations rises significantly, exposing deficiencies in their self-scrutiny capabilities. This paper proposes Faulty Premises Bench (FPBench), the first code generation evaluation framework targeting faulty premises. By systematically constructing three categories of faulty premises and integrating multi-dimensional evaluation metrics, it conducts in-depth assessments of 15 representative LLMs. The key findings are as follows: (1) Most models exhibit poor reasoning abilities and suboptimal code generation performance under faulty premises, heavily relying on explicit prompts for error detection, with limited self-scrutiny capabilities; (2) Faulty premises…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
