Steering to Say No: Configurable Refusal via Activation Steering in Vision Language Models

Jiaxi Yang; Shicheng Liu; Yuchen Yang; Dongwon Lee

arXiv:2602.07013·cs.CV·February 10, 2026

Steering to Say No: Configurable Refusal via Activation Steering in Vision Language Models

Jiaxi Yang, Shicheng Liu, Yuchen Yang, Dongwon Lee

PDF

Open Access

TL;DR

This paper introduces CR-VLM, a novel activation steering method enabling vision language models to provide configurable refusals, improving safety and adaptability to diverse user needs through a robust, multi-component approach.

Contribution

The paper presents CR-VLM, the first method for configurable refusal in VLMs using activation steering, integrating refusal vector extraction, gating, and visual alignment modules.

Findings

01

CR-VLM outperforms existing refusal methods in accuracy and robustness.

02

It effectively balances refusal and acceptance across diverse datasets.

03

The approach enhances safety and user adaptability in VLMs.

Abstract

With the rapid advancement of Vision Language Models (VLMs), refusal mechanisms have become a critical component for ensuring responsible and safe model behavior. However, existing refusal strategies are largely \textit{one-size-fits-all} and fail to adapt to diverse user needs and contextual constraints, leading to either under-refusal or over-refusal. In this work, we firstly explore the challenges mentioned above and develop \textbf{C}onfigurable \textbf{R}efusal in \textbf{VLM}s (\textbf{CR-VLM}), a robust and efficient approach for {\em configurable} refusal based on activation steering. CR-VLM consists of three integrated components: (1) extracting a configurable refusal vector via a teacher-forced mechanism to amplify the refusal signal; (2) introducing a gating mechanism that mitigates over-refusal by preserving acceptance for in-scope queries; and (3) designing a counterfactual…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Topic Modeling · Natural Language Processing Techniques