SafeConstellations: Mitigating Over-Refusals in LLMs Through Task-Aware Representation Steering
Utsav Maskey, Sumit Yadav, Mark Dras, Usman Naseem

TL;DR
This paper introduces SafeConstellations, a method that reduces over-refusals in LLMs by steering task-specific representations along learned trajectories, improving utility without sacrificing safety.
Contribution
It presents a novel inference-time approach that guides LLM representations to avoid over-refusal patterns based on task-aware trajectory analysis.
Findings
SafeConstellations significantly reduces over-refusals in LLMs.
The method maintains high utility while improving safety.
Trajectory patterns in embedding space are consistent across layers for specific tasks.
Abstract
LLMs increasingly exhibit over-refusal behavior, where safety mechanisms cause models to reject benign instructions that seemingly resemble harmful content. This phenomenon diminishes utility in production applications that repeatedly rely on common prompt templates or applications that frequently rely on LLMs for specific tasks (e.g. sentiment analysis, language translation). Through extensive evaluation, we demonstrate that LLMs persist in refusing inputs containing harmful content, even when they are reframed with tasks that have benign intent. Our mechanistic analysis reveals that LLMs follow distinct "constellation" patterns in embedding space as representations traverse layers, with each NLP task maintaining consistent trajectories that shift predictably between refusal and non-refusal cases. We introduce SafeConstellations, an inference-time trajectory-shifting approach that…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
