Certainty robustness: Evaluating LLM stability under self-challenging prompts

Mohammadreza Saadat; Steve Nemzer

arXiv:2603.03330·cs.CL·March 5, 2026

Certainty robustness: Evaluating LLM stability under self-challenging prompts

Mohammadreza Saadat, Steve Nemzer

PDF

Open Access 1 Datasets

TL;DR

This paper introduces a new benchmark to evaluate how large language models respond to self-challenging prompts, revealing differences in their stability and trustworthiness during interactive questioning.

Contribution

The paper proposes the Certainty Robustness Benchmark, a two-turn evaluation framework for assessing LLM stability and adaptability under challenging prompts, highlighting a new dimension of model evaluation.

Findings

01

Some models abandon correct answers under challenge.

02

Other models resist challenge and align confidence with correctness.

03

Substantial differences in interactive reliability among models.

Abstract

Large language models (LLMs) often present answers with high apparent confidence despite lacking an explicit mechanism for reasoning about certainty or truth. While existing benchmarks primarily evaluate single-turn accuracy, truthfulness or confidence calibration, they do not capture how models behave when their responses are challenged in interactive settings. We introduce the Certainty Robustness Benchmark, a two-turn evaluation framework that measures how LLMs balance stability and adaptability under self-challenging prompts such as uncertainty ("Are you sure?") and explicit contradiction ("You are wrong!"), alongside numeric confidence elicitation. Using 200 reasoning and mathematics questions from LiveBench, we evaluate four state-of-the-art LLMs and distinguish between justified self-corrections and unjustified answer changes. Our results reveal substantial differences in…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

Reza-Telus/certainty-robustness-llm-evaluation
dataset· 37 dl
37 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Artificial Intelligence in Healthcare and Education · Text Readability and Simplification