Defenses at Odds: Measuring and Explaining Defense Conflicts in Large Language Models
Xiangtao Meng, Wenyu Chen, Chuanchao Zang, Xinyu Gao, Jianing Wang, Li Wang, Zheng Li, Shanqing Guo

TL;DR
This paper systematically studies how sequential defenses in large language models interact, revealing conflicts and proposing a layer freezing mitigation to preserve protections during incremental deployment.
Contribution
It is the first to analyze cross-defense interactions in sequential LLM deployment and introduces a conflict score and mitigation strategy to address defense conflicts.
Findings
38.9% of defense sequences show risk exacerbation
Conflicting defenses localize to critical layers with anti-aligned updates
Layer freezing mitigates conflicts without harming secondary defenses
Abstract
Large Language Models (LLMs) deployed in high-stakes applications must simultaneously manage multiple risks, yet existing defenses are almost exclusively evaluated in isolation under a one-shot deployment assumption. In practice, providers patch models incrementally throughout their lifecycle-responding to newly exposed vulnerabilities or targeted data-removal requests without retraining from scratch. This raises a fundamental but underexplored question: does a later defense preserve the protections established by an earlier one? We present the first systematic study of cross-defense interactions under sequential deployment. Evaluating 144 ordered sequences across three risk dimensions and three model families, we find that 38.9% exhibit measurable risk exacerbation on the originally defended dimension. These interactions are highly asymmetric and order-dependent. To explain these…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
