Safety-Utility Conflicts Are Not Global: Surgical Alignment via Head-Level Diagnosis
Wang Cai, Yilin Wen, Jinchang Hou, Du Su, Guoqiu Wang, Zhonghou Lv, Chenfu Bao, Yunfang Wu

TL;DR
This paper introduces Conflict-Aware Sparse Tuning (CAST), a head-level diagnosis method for LLM safety alignment that selectively updates parameters, reducing safety-utility conflicts and preserving capabilities.
Contribution
It proposes a novel framework that diagnoses and selectively updates attention heads in transformers, addressing the limitations of global gradient-based methods.
Findings
Alignment conflicts are unevenly distributed across heads.
Skipping high-conflict heads preserves capabilities while improving safety.
Selective head tuning enhances safety-utility trade-offs.
Abstract
Safety alignment in Large Language Models (LLMs) inherently presents a multi-objective optimization conflict, often accompanied by an unintended degradation of general capabilities. Existing mitigation strategies typically rely on global gradient geometry to resolve these conflicts, yet they overlook Modular Heterogeneity within Transformers, specifically that the functional sensitivity and degree of conflict vary substantially across different attention heads. Such global approaches impose uniform update rules across all parameters, often resulting in suboptimal trade-offs by indiscriminately updating utility sensitive heads that exhibit intense gradient conflicts. To address this limitation, we propose Conflict-Aware Sparse Tuning (CAST), a framework that integrates head-level diagnosis with sparse fine-tuning. CAST first constructs a pre-alignment conflict map by synthesizing…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdversarial Robustness in Machine Learning · Safety Systems Engineering in Autonomy · Explainable Artificial Intelligence (XAI)
