Don't Take the Premise for Granted: Evaluating the Premise Critique Ability of Large Language Models

Jinzhe Li; Gengxu Li; Yi Chang; Yuan Wu

arXiv:2505.23715·cs.CL·November 25, 2025

Don't Take the Premise for Granted: Evaluating the Premise Critique Ability of Large Language Models

Jinzhe Li, Gengxu Li, Yi Chang, Yuan Wu

PDF

Open Access 1 Repo 1 Datasets

TL;DR

This paper evaluates the ability of large language models to identify and critique flawed premises, introducing a benchmark and revealing their reliance on prompts and difficulty-dependent performance.

Contribution

It introduces the Premise Critique Benchmark (PCBench) with diverse error types and levels, systematically assessing LLMs' premise critique abilities.

Findings

01

Most models rely on explicit prompts for error detection.

02

Premise critique ability varies with error type and difficulty.

03

Flawed premises cause overthinking, lengthening responses.

Abstract

Large language models (LLMs) have witnessed rapid advancements, demonstrating remarkable capabilities. However, a notable vulnerability persists: LLMs often uncritically accept flawed or contradictory premises, leading to inefficient reasoning and unreliable outputs. This emphasizes the significance of possessing the \textbf{Premise Critique Ability} for LLMs, defined as the capacity to proactively identify and articulate errors in input premises. Most existing studies assess LLMs' reasoning ability in ideal settings, largely ignoring their vulnerabilities when faced with flawed premises. Thus, we introduce the \textbf{Premise Critique Bench (PCBench)}, designed by incorporating four error types across three difficulty levels, paired with multi-faceted evaluation metrics. We conducted systematic evaluations of 15 representative LLMs. Our findings reveal: (1) Most models rely heavily on…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

mlgroupjlu/premise_critique
noneOfficial

Datasets

ALIENS232/PCBench
dataset· 35 dl
35 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Artificial Intelligence in Healthcare and Education · Text Readability and Simplification