Feature-Guided SAE Steering for Refusal-Rate Control using Contrasting Prompts
Samaksh Bhargav, Zining Zhu

TL;DR
This paper introduces a feature-guided SAE steering method that improves LLM safety and utility by systematically selecting features using contrasting prompts, outperforming traditional safety-utility tradeoffs.
Contribution
It presents a novel feature selection approach with contrasting prompts for SAE steering, enhancing safety and utility in LLMs without weight adjustments.
Findings
18.9% improvement in safety performance
11.1% increase in utility
Effective feature selection with contrasting prompts
Abstract
Large Language Model (LLM) deployment requires guiding the LLM to recognize and not answer unsafe prompts while complying with safe prompts. Previous methods for achieving this require adjusting model weights along with other expensive procedures. While recent advances in Sparse Autoencoders (SAEs) have enabled interpretable feature extraction from LLMs, existing approaches lack systematic feature selection methods and principled evaluation of safety-utility tradeoffs. We explored using different steering features and steering strengths using Sparse Auto Encoders (SAEs) to provide a solution. Using an accurate and innovative contrasting prompt method with the AI-Generated Prompts Dataset from teknium/OpenHermes-2p5-Mistral-7B and Air Bench eu-dataset to efficiently choose the best features in the model to steer, we tested this method on Llama-3 8B. We conclude that using this method,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdversarial Robustness in Machine Learning · Explainable Artificial Intelligence (XAI) · Safety Systems Engineering in Autonomy
