LLM Hypnosis: Exploiting User Feedback for Unauthorized Knowledge Injection to All Users
Almog Hilel, Riddhi Bhagwat, Idan Shenfeld, Jacob Andreas, Leshem Choshen

TL;DR
This paper uncovers a vulnerability in language models trained with user feedback, enabling a single user to manipulate the model's knowledge and behavior through strategic feedback, leading to potential security and misinformation risks.
Contribution
It introduces a novel attack method exploiting preference feedback to persistently alter language model outputs, revealing new security concerns in user-feedback-trained models.
Findings
Attacker can insert factual knowledge into the model.
Attacker can modify code generation to introduce security flaws.
Attacker can inject fake financial news.
Abstract
We describe a vulnerability in language models (LMs) trained with user feedback, whereby a single user can persistently alter LM knowledge and behavior given only the ability to provide prompts and upvote / downvote feedback on LM outputs. To implement the attack, the attacker prompts the LM to stochastically output either a "poisoned" or benign response, then upvotes the poisoned response or downvotes the benign one. When feedback signals are used in a subsequent preference tuning behavior, LMs exhibit increased probability of producing poisoned responses even in contexts without malicious prompts. We show that this attack can be used to (1) insert factual knowledge the model did not previously possess, (2) modify code generation patterns in ways that introduce exploitable security flaws, and (3) inject fake financial news. Our finding both identifies a new qualitative feature of…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
