SafeSeek: Universal Attribution of Safety Circuits in Language Models

Miao Yu; Siyuan Fu; Moayad Aloqaily; Zhenhong Zhou; Safa Otoum; Xing fan; Kun Wang; Yufei Guo; Qingsong Wen

arXiv:2603.23268·cs.LG·March 25, 2026

SafeSeek: Universal Attribution of Safety Circuits in Language Models

Miao Yu, Siyuan Fu, Moayad Aloqaily, Zhenhong Zhou, Safa Otoum, Xing fan, Kun Wang, Yufei Guo, Qingsong Wen

PDF

Open Access

TL;DR

SafeSeek introduces a unified, optimization-based framework for identifying safety-critical circuits in large language models, enabling reliable safety attribution and effective fine-tuning to mitigate risks.

Contribution

It proposes a differentiable masking method to extract complete safety circuits in LLMs, surpassing heuristic approaches and facilitating safety fine-tuning.

Findings

01

Identified a backdoor circuit with 0.42% sparsity that drastically reduces attack success.

02

Localized an alignment circuit with 3.03% heads and 0.79% neurons, affecting safety and utility.

03

Demonstrated effective safety fine-tuning by removing identified circuits without utility loss.

Abstract

Mechanistic interpretability reveals that safety-critical behaviors (e.g., alignment, jailbreak, backdoor) in Large Language Models (LLMs) are grounded in specialized functional components. However, existing safety attribution methods struggle with generalization and reliability due to their reliance on heuristic, domain-specific metrics and search algorithms. To address this, we propose \ourmethod, a unified safety interpretability framework that identifies functionally complete safety circuits in LLMs via optimization. Unlike methods focusing on isolated heads or neurons, \ourmethod introduces differentiable binary masks to extract multi-granular circuits through gradient descent on safety datasets, while integrates Safety Circuit Tuning to utilize these sparse circuits for efficient safety fine-tuning. We validate \ourmethod in two key scenarios in LLM safety: \textbf{(1) backdoor…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdversarial Robustness in Machine Learning · Explainable Artificial Intelligence (XAI) · Topic Modeling