Poison Once, Refuse Forever: Weaponizing Alignment for Injecting Bias in LLMs

Md Abdullah Al Mamun; Ihsen Alouani; Nael Abu-Ghazaleh

arXiv:2508.20333·cs.LG·August 29, 2025

Poison Once, Refuse Forever: Weaponizing Alignment for Injecting Bias in LLMs

Md Abdullah Al Mamun, Ihsen Alouani, Nael Abu-Ghazaleh

PDF

TL;DR

This paper reveals how adversaries can exploit LLM alignment to inject bias and enforce censorship without degrading responsiveness, posing significant security risks to AI applications.

Contribution

It introduces Subversive Alignment Injection (SAI), a novel poisoning attack that manipulates LLM refusal mechanisms to implant bias and evade defenses.

Findings

01

SAI can induce targeted bias with minimal data poisoning.

02

State-of-the-art defenses fail to detect SAI attacks.

03

Significant bias increases observed in real-world NLP applications.

Abstract

Large Language Models (LLMs) are aligned to meet ethical standards and safety requirements by training them to refuse answering harmful or unsafe prompts. In this paper, we demonstrate how adversaries can exploit LLMs' alignment to implant bias, or enforce targeted censorship without degrading the model's responsiveness to unrelated topics. Specifically, we propose Subversive Alignment Injection (SAI), a poisoning attack that leverages the alignment mechanism to trigger refusal on specific topics or queries predefined by the adversary. Although it is perhaps not surprising that refusal can be induced through overalignment, we demonstrate how this refusal can be exploited to inject bias into the model. Surprisingly, SAI evades state-of-the-art poisoning defenses including LLM state forensics, as well as robust aggregation techniques that are designed to detect poisoning in FL settings.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.