On the Robustness of Verbal Confidence of LLMs in Adversarial Attacks
Stephen Obadinma, Xiaodan Zhu

TL;DR
This paper investigates how large language models' verbal confidence scores are vulnerable to adversarial attacks, revealing significant weaknesses and the ineffectiveness of current defenses, which impacts trust and safety in AI applications.
Contribution
It provides the first comprehensive analysis of verbal confidence robustness in LLMs under adversarial attacks, introducing new attack frameworks and evaluating defense strategies.
Findings
Adversarial attacks can significantly impair LLMs' verbal confidence.
Current defense techniques are largely ineffective or counterproductive.
Verbal confidence is highly vulnerable to subtle semantic-preserving modifications.
Abstract
Robust verbal confidence generated by large language models (LLMs) is crucial for the deployment of LLMs to help ensure transparency, trust, and safety in many applications, including those involving human-AI interactions. In this paper, we present the first comprehensive study on the robustness of verbal confidence under adversarial attacks. We introduce attack frameworks targeting verbal confidence scores through both perturbation and jailbreak-based methods, and demonstrate that these attacks can significantly impair verbal confidence estimates and lead to frequent answer changes. We examine a variety of prompting strategies, model sizes, and application domains, revealing that current verbal confidence is vulnerable and that commonly used defence techniques are largely ineffective or counterproductive. Our findings underscore the need to design robust mechanisms for confidence…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdversarial Robustness in Machine Learning · Topic Modeling · Natural Language Processing Techniques
