Colluding LoRA: A Compositional Vulnerability in LLM Safety Alignment
Sihao Ding

TL;DR
This paper reveals a compositional vulnerability in modular LLM safety alignment, where benign adapters can combine to produce harmful behavior, exposing a blind spot in current defenses.
Contribution
The study introduces Colluding LoRA (CoLoRA), demonstrating how adapter compositions can compromise safety, highlighting the need for composition-aware verification methods.
Findings
Individual adapters are benign in isolation.
Adapter compositions can lead to high attack success rates.
Current defenses are insufficient against composition-triggered vulnerabilities.
Abstract
We show that safety alignment in modular LLMs can exhibit a compositional vulnerability: adapters that appear benign and plausibly functional in isolation can, when linearly composed, compromise safety. We study this failure mode through Colluding LoRA (CoLoRA), in which harmful behavior emerges only in the composition state. Unlike attacks that depend on adversarial prompts or explicit input triggers, this composition-triggered broad refusal suppression causes the model to comply with harmful requests under standard prompts once a particular set of adapters is loaded. This behavior exposes a combinatorial blind spot in current unit-centric defenses, for which exhaustive verification over adapter compositions is computationally intractable. Across several open-weight LLMs, we find that individual adapters remain benign in isolation while their composition yields high attack success…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
