LLM-VA: Resolving the Jailbreak-Overrefusal Trade-off via Vector Alignment

Haonan Zhang; Dongxia Wang; Yi Liu; Kexin Chen; Wenhai Wang

arXiv:2601.19487·cs.LG·May 5, 2026

LLM-VA: Resolving the Jailbreak-Overrefusal Trade-off via Vector Alignment

Haonan Zhang, Dongxia Wang, Yi Liu, Kexin Chen, Wenhai Wang

PDF

1 Repo

TL;DR

This paper introduces LLM-VA, a method that aligns answer and safety vectors in language models to improve safety without sacrificing utility, addressing the jailbreak-over-refusal trade-off.

Contribution

LLM-VA is a novel vector alignment technique that causally links answer willingness to safety assessment without fine-tuning or architecture changes.

Findings

01

Achieves 11.45% higher F1 score than baselines.

02

Preserves 95.92% of model utility.

03

Automatically adapts to different safety biases.

Abstract

Safety-aligned LLMs suffer from two failure modes: jailbreak (answering harmful inputs) and over-refusal (declining benign queries). Existing vector steering methods adjust the magnitude of answer vectors, but this creates a fundamental trade-off -- reducing jailbreak increases over-refusal and vice versa. We identify the root cause: LLMs encode the decision to answer (answer vector $v_{a}$ ) and the judgment of input safety (benign vector $v_{b}$ ) as nearly orthogonal directions, treating them as independent processes. We propose LLM-VA, which aligns $v_{a}$ with $v_{b}$ through closed-form weight updates, making the model's willingness to answer causally dependent on its safety assessment -- without fine-tuning or architectural changes. Our method identifies vectors at each layer using SVMs, selects safety-relevant layers, and iteratively aligns vectors via minimum-norm weight modifications.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

https://hotbento.github.io/LLM-VA-Web
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.