EVA: Editing for Versatile Alignment against Jailbreaks
Yi Wang, Hongye Qiu, Yue Xu, Sibei Yang, Zhan Qin, Minlie Huang, Wenjie Wang

TL;DR
EVA introduces a novel model editing approach that precisely modifies specific neurons in LLMs and VLMs to improve safety against jailbreaks without degrading overall performance.
Contribution
EVA pioneers the use of direct model editing for safety alignment, targeting specific neurons to neutralize harmful behaviors efficiently.
Findings
EVA effectively mitigates jailbreaks in LLMs and VLMs.
EVA preserves the models' general reasoning capabilities.
EVA outperforms baseline methods in safety alignment tasks.
Abstract
Large Language Models (LLMs) and Vision Language Models (VLMs) have demonstrated impressive capabilities but remain vulnerable to jailbreaking attacks, where adversaries exploit textual or visual triggers to bypass safety guardrails. Recent defenses typically rely on safety fine-tuning or external filters to reduce the model's likelihood of producing harmful content. While effective to some extent, these methods often incur significant computational overheads and suffer from the safety utility trade-off, degrading the model's performance on benign tasks. To address these challenges, we propose EVA (Editing for Versatile Alignment against Jailbreaks), a novel framework that pioneers the application of direct model editing for safety alignment. EVA reframes safety alignment as a precise knowledge correction task. Instead of retraining massive parameters, EVA identifies and surgically…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
