Colluding LoRA: A Compositional Vulnerability in LLM Safety Alignment

Sihao Ding

arXiv:2603.12681·cs.CR·March 31, 2026

Colluding LoRA: A Compositional Vulnerability in LLM Safety Alignment

Sihao Ding

PDF

TL;DR

This paper reveals a compositional vulnerability in modular LLM safety alignment, where benign adapters can combine to produce harmful behavior, exposing a blind spot in current defenses.

Contribution

The study introduces Colluding LoRA (CoLoRA), demonstrating how adapter compositions can compromise safety, highlighting the need for composition-aware verification methods.

Findings

01

Individual adapters are benign in isolation.

02

Adapter compositions can lead to high attack success rates.

03

Current defenses are insufficient against composition-triggered vulnerabilities.

Abstract

We show that safety alignment in modular LLMs can exhibit a compositional vulnerability: adapters that appear benign and plausibly functional in isolation can, when linearly composed, compromise safety. We study this failure mode through Colluding LoRA (CoLoRA), in which harmful behavior emerges only in the composition state. Unlike attacks that depend on adversarial prompts or explicit input triggers, this composition-triggered broad refusal suppression causes the model to comply with harmful requests under standard prompts once a particular set of adapters is loaded. This behavior exposes a combinatorial blind spot in current unit-centric defenses, for which exhaustive verification over adapter compositions is computationally intractable. Across several open-weight LLMs, we find that individual adapters remain benign in isolation while their composition yields high attack success…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.