Uncertainty is Fragile: Manipulating Uncertainty in Large Language Models
Qingcheng Zeng, Mingyu Jin, Qinkai Yu, Zhenting Wang, Wenyue Hua,, Zihao Zhou, Guangyan Sun, Yanda Meng, Shiqing Ma, Qifan Wang, Felix, Juefei-Xu, Kaize Ding, Fan Yang, Ruixiang Tang, Yongfeng Zhang

TL;DR
This paper reveals that the uncertainty estimates of large language models can be manipulated through backdoor attacks, undermining their reliability without changing the model's primary output, posing a significant security threat.
Contribution
It introduces a novel backdoor attack method that manipulates LLMs' uncertainty estimates while preserving their top predictions, exposing a new vulnerability in LLM reliability.
Findings
Achieved 100% attack success rate across multiple models and strategies.
Demonstrated manipulation of uncertainty estimates without affecting top-1 predictions.
Showed the attack's effectiveness across different prompts and domains.
Abstract
Large Language Models (LLMs) are employed across various high-stakes domains, where the reliability of their outputs is crucial. One commonly used method to assess the reliability of LLMs' responses is uncertainty estimation, which gauges the likelihood of their answers being correct. While many studies focus on improving the accuracy of uncertainty estimations for LLMs, our research investigates the fragility of uncertainty estimation and explores potential attacks. We demonstrate that an attacker can embed a backdoor in LLMs, which, when activated by a specific trigger in the input, manipulates the model's uncertainty without affecting the final output. Specifically, the proposed backdoor attack method can alter an LLM's output probability distribution, causing the probability distribution to converge towards an attacker-predefined distribution while ensuring that the top-1 prediction…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques
MethodsFocus
