Steering to Say No: Configurable Refusal via Activation Steering in Vision Language Models
Jiaxi Yang, Shicheng Liu, Yuchen Yang, Dongwon Lee

TL;DR
This paper introduces CR-VLM, a novel activation steering method enabling vision language models to provide configurable refusals, improving safety and adaptability to diverse user needs through a robust, multi-component approach.
Contribution
The paper presents CR-VLM, the first method for configurable refusal in VLMs using activation steering, integrating refusal vector extraction, gating, and visual alignment modules.
Findings
CR-VLM outperforms existing refusal methods in accuracy and robustness.
It effectively balances refusal and acceptance across diverse datasets.
The approach enhances safety and user adaptability in VLMs.
Abstract
With the rapid advancement of Vision Language Models (VLMs), refusal mechanisms have become a critical component for ensuring responsible and safe model behavior. However, existing refusal strategies are largely \textit{one-size-fits-all} and fail to adapt to diverse user needs and contextual constraints, leading to either under-refusal or over-refusal. In this work, we firstly explore the challenges mentioned above and develop \textbf{C}onfigurable \textbf{R}efusal in \textbf{VLM}s (\textbf{CR-VLM}), a robust and efficient approach for {\em configurable} refusal based on activation steering. CR-VLM consists of three integrated components: (1) extracting a configurable refusal vector via a teacher-forced mechanism to amplify the refusal signal; (2) introducing a gating mechanism that mitigates over-refusal by preserving acceptance for in-scope queries; and (3) designing a counterfactual…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Topic Modeling · Natural Language Processing Techniques
