Confidence Elicitation: A New Attack Vector for Large Language Models
Brian Formento, Chuan Sheng Foo, See-Kiong Ng

TL;DR
This paper introduces a novel attack vector for large language models by eliciting and minimizing model confidence, enabling more effective black-box adversarial attacks without access to internal probabilities.
Contribution
The work demonstrates that confidence elicitation can be used as a new attack paradigm against LLMs, showing state-of-the-art results in black-box adversarial settings.
Findings
Elicited confidence is well-calibrated and not hallucinated.
Minimizing elicited confidence increases misclassification likelihood.
Achieves superior attack success rates compared to existing methods.
Abstract
A fundamental issue in deep learning has been adversarial robustness. As these systems have scaled, such issues have persisted. Currently, large language models (LLMs) with billions of parameters suffer from adversarial attacks just like their earlier, smaller counterparts. However, the threat models have changed. Previously, having gray-box access, where input embeddings or output logits/probabilities were visible to the user, might have been reasonable. However, with the introduction of closed-source models, no information about the model is available apart from the generated output. This means that current black-box attacks can only utilize the final prediction to detect if an attack is successful. In this work, we investigate and demonstrate the potential of attack guidance, akin to using output probabilities, while having only black-box access in a classification setting. This is…
Peer Reviews
Decision·ICLR 2025 Poster
* This study introduces the first approach for constructing adversarial examples of large language models (LLMs) by leveraging their confidence elicitation capabilities. * The proposed confidence elicitation attack is easy to conduct and requires fewer queries than existing black-box adversarial attacks. This straightforward adversarial attack provides an easy-to-implement method for testing the potential vulnerabilities of LLMs while maintaining a higher degree of semantic similarity in the p
The study provides insights into the calibration of LLMs and the effectiveness of confidence elicitation in guiding adversarial perturbations, highlighting potential implications for the robustness of LLMs. Thus, it contributes a new perspective to the field of adversarial machine learning. However, I still see some issues that may improve the quality of the paper: * Limited in the Classification Tasks. Different from existing Jailbreaks against the generation function of LLMs. This work focuse
1. The first paper proposes to apply confidence elicitation to address the unachievable soft-label issues in adversarial attacks for large language models. 2. Experimental results demonstrate the effectiveness of CEAttack compared with baseline attack methods Self-Fool Word Sub and SSPAttack.
1. Lack of baseline analysis compared with prompt optimization methods such as Tree of Attacks [1]. 2. The study scope of the paper is limited to classification tasks, which are only a small part of the tasks current large language models can do and have a minor impact for now. It is still unclear whether the method maintains effectiveness on generative tasks like jailbreaking. 3. Potential defense discussions about the attack. [1] Anay Mehrotra, Manolis Zampetakis, Paul Kassianik, Blaine Nelso
+ This work is significant to the field, as the closed-source nature of modern LLMs is often referenced when arguing against popular optimisation methods (that require grey-box access). Additionally, this work shows that there may be some consequences to the confidence elicitation that many developers have been pushing for, though it likely isn’t enough to pause this work. + The authors do a good job in presenting existing work, its findings and problems, as well as presenting their own method
+ Experiments could possibly be improved by evaluating against actually closed-sourced models, especially since they should be more resilient to non-semantic input perturbations and are probably better at confidence elicitation. In fact, I assume that using anything smaller than the instruction-tuned 8B parameter models from the paper, will result in the prompt template no longer working? + Since open-source models were used, the authors could have shown what difference still remains between yo
Code & Models
Videos
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques
