The Dark Side of Human Feedback: Poisoning Large Language Models via User Inputs
Bocheng Chen, Hanqing Guo, Guangjing Wang, Yuanda Wang, Qiben Yan

TL;DR
This paper uncovers a hidden vulnerability in large language models where malicious user prompts can subtly manipulate training rewards, leading to increased toxicity and model degradation without detection.
Contribution
It introduces novel poisoning attack methods via user prompts that can compromise LLM alignment training, revealing a critical security flaw.
Findings
Injection of 1% malicious prompts doubles toxicity scores.
Attack remains effective across different reward models and base LLMs.
Proposes two mechanisms for crafting malicious prompts: selection-based and generation-based.
Abstract
Large Language Models (LLMs) have demonstrated great capabilities in natural language understanding and generation, largely attributed to the intricate alignment process using human feedback. While alignment has become an essential training component that leverages data collected from user queries, it inadvertently opens up an avenue for a new type of user-guided poisoning attacks. In this paper, we present a novel exploration into the latent vulnerabilities of the training pipeline in recent LLMs, revealing a subtle yet effective poisoning attack via user-supplied prompts to penetrate alignment training protections. Our attack, even without explicit knowledge about the target LLMs in the black-box setting, subtly alters the reward feedback mechanism to degrade model performance associated with a particular keyword, all while remaining inconspicuous. We propose two mechanisms for…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Explainable Artificial Intelligence (XAI)
MethodsBalanced Selection
