TL;DR
This paper introduces LLM-VA, a method that aligns answer and safety vectors in language models to improve safety without sacrificing utility, addressing the jailbreak-over-refusal trade-off.
Contribution
LLM-VA is a novel vector alignment technique that causally links answer willingness to safety assessment without fine-tuning or architecture changes.
Findings
Achieves 11.45% higher F1 score than baselines.
Preserves 95.92% of model utility.
Automatically adapts to different safety biases.
Abstract
Safety-aligned LLMs suffer from two failure modes: jailbreak (answering harmful inputs) and over-refusal (declining benign queries). Existing vector steering methods adjust the magnitude of answer vectors, but this creates a fundamental trade-off -- reducing jailbreak increases over-refusal and vice versa. We identify the root cause: LLMs encode the decision to answer (answer vector ) and the judgment of input safety (benign vector ) as nearly orthogonal directions, treating them as independent processes. We propose LLM-VA, which aligns with through closed-form weight updates, making the model's willingness to answer causally dependent on its safety assessment -- without fine-tuning or architectural changes. Our method identifies vectors at each layer using SVMs, selects safety-relevant layers, and iteratively aligns vectors via minimum-norm weight modifications.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
