Programming Refusal with Conditional Activation Steering

Bruce W. Lee; Inkit Padhi; Karthikeyan Natesan Ramamurthy; Erik; Miehling; Pierre Dognin; Manish Nagireddy; Amit Dhurandhar

arXiv:2409.05907·cs.LG·February 19, 2025

Programming Refusal with Conditional Activation Steering

Bruce W. Lee, Inkit Padhi, Karthikeyan Natesan Ramamurthy, Erik, Miehling, Pierre Dognin, Manish Nagireddy, Amit Dhurandhar

PDF

Open Access 1 Repo 2 Datasets 1 Video 3 Reviews

TL;DR

This paper introduces Conditional Activation Steering (CAST), a method that analyzes LLM activation patterns during inference to selectively control responses based on input content, improving content moderation and domain-specific applications.

Contribution

CAST is a novel activation steering approach that enables selective response control without weight optimization by analyzing hidden state patterns during inference.

Findings

01

Effectively distinguishes prompt categories via activation patterns.

02

Enables content-specific refusal without retraining.

03

Open-source implementation available at github.com/IBM/activation-steering.

Abstract

LLMs have shown remarkable capabilities, but precisely controlling their response behavior remains challenging. Existing activation steering methods alter LLM behavior indiscriminately, limiting their practical applicability in settings where selective responses are essential, such as content moderation or domain-specific assistants. In this paper, we propose Conditional Activation Steering (CAST), which analyzes LLM activation patterns during inference to selectively apply or withhold activation steering based on the input context. Our method is based on the observation that different categories of prompts activate distinct patterns in the model's hidden states. Using CAST, one can systematically control LLM behavior with rules like "if input is about hate speech or adult content, then refuse" or "if input is not about legal advice, then refuse." This allows for selective modification…

Peer Reviews

Decision·ICLR 2025 Spotlight

Reviewer 01Rating 8Confidence 4

Strengths

* Novel Approach: CAST represents a unique advancement in activation steering by adding the ability to conditionally refuse specific categories of prompts. This method is valuable in fields like content moderation and personalized assistant behavior where indiscriminate refusal would limit utility. * Empirical Validation: Extensive experiments demonstrate CAST’s efficacy in refusing harmful prompts without affecting benign responses across multiple LLMs. The results indicate robust behavior modi

Weaknesses

* Figure 6a -- why is conditions triggered the 'success' metric here? Shouldn't it be something like F1 score (for aggregating true positives, false positives, etc. to show performance at each data scale) * Why is duality of the comparison direction highlighted? Isn't it obvious that flipping the comparison direction and using a threshold of (1-c) yields the same decision boundary but flips the decision? I might be missing something here. * The main text doesn't seem to explain well how false an

Reviewer 02Rating 6Confidence 4

Strengths

- Introduces the concept of 'condition vectors' as a way to gate which activation vectors are triggered at each step. - Thoroughly demonstrates the use of CAST to handle a variety of new conditional behaviors, such as more precise and robust refusal and topic-based refusal. - Paper is mostly clearly written with well-formatted figures. The setup seems reproducible and is easy to understand. - Helps extend the idea of activation steering to make it more robust to multiple scenarios. In summary,

Weaknesses

These suggestions are minor as the paper itself is well done. That said, I think the paper could benefit from: **Qualitative understanding of the condition vectors.** The main insight of the work seems to be that you can construct condition vectors in a similar way to constructing behavior vectors. But I'm still not sure how robust these condition vectors are. It would be great if the authors could run an experiment studying the generalization of these condition vectors. - For example, you

Reviewer 03Rating 8Confidence 3

Strengths

1. The method of steering LLM conditionally on the context of the prompt is novel and an important contribution towards practical implementations of activation steering. 2. The ability to chain conditionals is an interesting contribution. 3. The paper is relatively thorough in its test of models within a certain class O(8B).

Weaknesses

1. All the tested models have less than or equal to 8B parameters. Testing on larger models would help improve the robustness and confidence in the results 2. (Minor) The harmless/harmful refusals are not tested against enough real-world inputs, like jailbreaks or multi-turn conversations. 3. (Minor) There is no limitations or future work section.

Code & Models

Repositories

ibm/activation-steering
pytorchOfficial

Datasets

Videos

Programming Refusal with Conditional Activation Steering· slideslive

Taxonomy

TopicsSoftware Reliability and Analysis Research · Logic, programming, and type systems · Distributed systems and fault tolerance