$C$-$\Delta\Theta$: Circuit-Restricted Weight Arithmetic for Selective Refusal
Aditya Kasliwal, Pratinav Seth, Vinay Kumar Sankarapu

TL;DR
This paper introduces C-Δθ, a method for offline, circuit-restricted weight updates that enable selective refusal in language models without inference-time interventions, reducing runtime costs.
Contribution
The paper presents a novel offline weight update technique, C-Δθ, which localizes refusal behavior into a sparse circuit and applies constrained updates for improved efficiency.
Findings
Effective category-specific refusal achieved
Significant reduction in inference-time overhead
Maintains model utility after updates
Abstract
Modern deployments require LLMs to enforce safety policies at scale, yet many controls rely on inference-time interventions that add recurring compute cost and serving complexity. Activation steering is widely used, but it requires runtime hooks and scales cost with the number of generations; conditional variants improve selectivity by gating when steering is applied but still retain an inference-time control path. We ask whether selective refusal can be moved entirely offline: can a mechanistic understanding of category-specific refusal be distilled into a circuit-restricted weight update that deploys as a standard checkpoint? We propose C-{\Delta}{\theta}: Circuit Restricted Weight Arithmetic, which (i) localizes refusal-causal computation as a sparse circuit using EAP-IG and (ii) computes a constrained weight update {\Delta}{\theta}C supported only on that circuit (typically <5% of…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsFormal Methods in Verification · Distributed systems and fault tolerance · Security and Verification in Computing
