Just Enough Shifts: Mitigating Over-Refusal in Aligned Language Models with Targeted Representation Fine-Tuning
Mahavir Dabas, Si Chen, Charles Fleming, Ming Jin, and Ruoxi Jia

TL;DR
This paper presents ACTOR, a targeted fine-tuning method that reduces over-refusals in aligned language models by adjusting internal activation patterns, improving user experience without sacrificing safety or utility.
Contribution
Introducing ACTOR, a compute- and data-efficient fine-tuning framework that precisely mitigates over-refusals by modifying internal activations in LLMs.
Findings
Significantly reduces over-refusals across multiple benchmarks.
Maintains model's ability to handle harmful queries.
Preserves overall utility of the language model.
Abstract
Safety alignment is crucial for large language models (LLMs) to resist malicious instructions but often results in over-refusals, where benign prompts are unnecessarily rejected, impairing user experience and model utility. We introduce ACTOR (Activation-Based Training for Over-Refusal Reduction), a robust and compute- and data-efficient training framework that minimizes over-refusals by leveraging internal activation patterns from diverse queries. ACTOR precisely identifies and adjusts the activation components that trigger refusals, providing stronger control over the refusal mechanism. By fine-tuning only a single model layer, ACTOR effectively reduces over-refusals across multiple benchmarks while maintaining the model's ability to handle harmful queries and preserve overall utility.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
