On the Robustness of Verbal Confidence of LLMs in Adversarial Attacks

Stephen Obadinma; Xiaodan Zhu

arXiv:2507.06489·cs.CL·December 19, 2025

On the Robustness of Verbal Confidence of LLMs in Adversarial Attacks

Stephen Obadinma, Xiaodan Zhu

PDF

Open Access

TL;DR

This paper investigates how large language models' verbal confidence scores are vulnerable to adversarial attacks, revealing significant weaknesses and the ineffectiveness of current defenses, which impacts trust and safety in AI applications.

Contribution

It provides the first comprehensive analysis of verbal confidence robustness in LLMs under adversarial attacks, introducing new attack frameworks and evaluating defense strategies.

Findings

01

Adversarial attacks can significantly impair LLMs' verbal confidence.

02

Current defense techniques are largely ineffective or counterproductive.

03

Verbal confidence is highly vulnerable to subtle semantic-preserving modifications.

Abstract

Robust verbal confidence generated by large language models (LLMs) is crucial for the deployment of LLMs to help ensure transparency, trust, and safety in many applications, including those involving human-AI interactions. In this paper, we present the first comprehensive study on the robustness of verbal confidence under adversarial attacks. We introduce attack frameworks targeting verbal confidence scores through both perturbation and jailbreak-based methods, and demonstrate that these attacks can significantly impair verbal confidence estimates and lead to frequent answer changes. We examine a variety of prompting strategies, model sizes, and application domains, revealing that current verbal confidence is vulnerable and that commonly used defence techniques are largely ineffective or counterproductive. Our findings underscore the need to design robust mechanisms for confidence…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdversarial Robustness in Machine Learning · Topic Modeling · Natural Language Processing Techniques