Uncertainty is Fragile: Manipulating Uncertainty in Large Language   Models

Qingcheng Zeng; Mingyu Jin; Qinkai Yu; Zhenting Wang; Wenyue Hua,; Zihao Zhou; Guangyan Sun; Yanda Meng; Shiqing Ma; Qifan Wang; Felix; Juefei-Xu; Kaize Ding; Fan Yang; Ruixiang Tang; Yongfeng Zhang

arXiv:2407.11282·cs.CL·July 22, 2024

Uncertainty is Fragile: Manipulating Uncertainty in Large Language Models

Qingcheng Zeng, Mingyu Jin, Qinkai Yu, Zhenting Wang, Wenyue Hua,, Zihao Zhou, Guangyan Sun, Yanda Meng, Shiqing Ma, Qifan Wang, Felix, Juefei-Xu, Kaize Ding, Fan Yang, Ruixiang Tang, Yongfeng Zhang

PDF

Open Access 2 Repos

TL;DR

This paper reveals that the uncertainty estimates of large language models can be manipulated through backdoor attacks, undermining their reliability without changing the model's primary output, posing a significant security threat.

Contribution

It introduces a novel backdoor attack method that manipulates LLMs' uncertainty estimates while preserving their top predictions, exposing a new vulnerability in LLM reliability.

Findings

01

Achieved 100% attack success rate across multiple models and strategies.

02

Demonstrated manipulation of uncertainty estimates without affecting top-1 predictions.

03

Showed the attack's effectiveness across different prompts and domains.

Abstract

Large Language Models (LLMs) are employed across various high-stakes domains, where the reliability of their outputs is crucial. One commonly used method to assess the reliability of LLMs' responses is uncertainty estimation, which gauges the likelihood of their answers being correct. While many studies focus on improving the accuracy of uncertainty estimations for LLMs, our research investigates the fragility of uncertainty estimation and explores potential attacks. We demonstrate that an attacker can embed a backdoor in LLMs, which, when activated by a specific trigger in the input, manipulates the model's uncertainty without affecting the final output. Specifically, the proposed backdoor attack method can alter an LLM's output probability distribution, causing the probability distribution to converge towards an attacker-predefined distribution while ensuring that the top-1 prediction…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques

MethodsFocus