When Safety Becomes a Vulnerability: Exploiting LLM Alignment Homogeneity for Transferable Blocking in RAG
Junchen Li, Chao Qi, Rongzheng Wang, Qizhi Chen, Liang Xu, Di Liang, Bob Simons, Shuang Liang

TL;DR
This paper reveals that safety alignment in large language models creates a shared vulnerability, enabling transferable blocking attacks in retrieval-augmented generation systems, demonstrated through a novel black-box attack framework called TabooRAG.
Contribution
The paper introduces TabooRAG, a transferable blocking attack method exploiting alignment homogeneity, and demonstrates its effectiveness across multiple LLMs and datasets.
Findings
TabooRAG achieves up to 96% success rate in transfer attacks.
Alignment homogeneity enables cross-model transferability of blocking attacks.
Safety-aligned LLMs share vulnerabilities that can be exploited in RAG systems.
Abstract
Retrieval-Augmented Generation (RAG) enhances the capabilities of large language models (LLMs) by incorporating external knowledge, but its reliance on potentially poisonable knowledge bases introduces new availability risks. Attackers can inject documents that cause LLMs to refuse benign queries, attacks known as blocking attacks. Prior blocking attacks relying on adversarial suffixes or explicit instruction injection are increasingly ineffective against modern safety-aligned LLMs. We observe that safety-aligned LLMs exhibit heightened sensitivity to query-relevant risk signals, causing alignment mechanisms designed for harm prevention to become a source of exploitable refusal. Moreover, mainstream alignment practices share overlapping risk categories and refusal criteria, a phenomenon we term alignment homogeneity, enabling restricted risk context constructed on an accessible LLM to…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdversarial Robustness in Machine Learning · Topic Modeling · Advanced Graph Neural Networks
