Defenses at Odds: Measuring and Explaining Defense Conflicts in Large Language Models

Xiangtao Meng; Wenyu Chen; Chuanchao Zang; Xinyu Gao; Jianing Wang; Li Wang; Zheng Li; Shanqing Guo

arXiv:2605.14514·cs.CR·May 15, 2026

Defenses at Odds: Measuring and Explaining Defense Conflicts in Large Language Models

Xiangtao Meng, Wenyu Chen, Chuanchao Zang, Xinyu Gao, Jianing Wang, Li Wang, Zheng Li, Shanqing Guo

PDF

TL;DR

This paper systematically studies how sequential defenses in large language models interact, revealing conflicts and proposing a layer freezing mitigation to preserve protections during incremental deployment.

Contribution

It is the first to analyze cross-defense interactions in sequential LLM deployment and introduces a conflict score and mitigation strategy to address defense conflicts.

Findings

01

38.9% of defense sequences show risk exacerbation

02

Conflicting defenses localize to critical layers with anti-aligned updates

03

Layer freezing mitigates conflicts without harming secondary defenses

Abstract

Large Language Models (LLMs) deployed in high-stakes applications must simultaneously manage multiple risks, yet existing defenses are almost exclusively evaluated in isolation under a one-shot deployment assumption. In practice, providers patch models incrementally throughout their lifecycle-responding to newly exposed vulnerabilities or targeted data-removal requests without retraining from scratch. This raises a fundamental but underexplored question: does a later defense preserve the protections established by an earlier one? We present the first systematic study of cross-defense interactions under sequential deployment. Evaluating 144 ordered sequences across three risk dimensions and three model families, we find that 38.9% exhibit measurable risk exacerbation on the originally defended dimension. These interactions are highly asymmetric and order-dependent. To explain these…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.