You Snooze, You Lose: Automatic Safety Alignment Restoration through Neural Weight Translation
Marco Arazzi, Vignesh Kumar Kembu, Antonino Nocera, Stjepan Picek, Saraga Sakthidharan

TL;DR
This paper introduces NeWTral, a neural weight translation framework that safely aligns domain-specific adapters in LLMs without retraining, preserving expertise while reducing safety risks.
Contribution
NeWTral is a novel, parameter-space translation method that maps unsafe adapters to safe ones using a pre-trained non-linear module and MoE routing, avoiding data and retraining.
Findings
Reduces attack success rate from 70% to 13%.
Maintains 90% knowledge fidelity across models.
Operates without access to original training data.
Abstract
The open-source ecosystem has accelerated the democratization of Large Language Models (LLMs) through the public distribution of specialized Low-Rank Adaptation (LoRA) modules. However, integrating these third-party adapters often induces catastrophic forgetting of the base model's foundational safety alignment. Restoring these guardrails via fine-tuning on safety data introduces an opposing failure mode: the severe degradation of the specialized domain knowledge the adapter was originally designed to provide. To overcome this zero-resource challenge, we propose Neural Weight Translation (NeWTral), a framework that directly maps unsafe, domain-specific adapters onto a safe alignment manifold while rigorously preserving their core expertise. NeWTral operates as a non-linear translation module pre-trained on a diverse corpus of unsafe-to-safe adapter pairs. By executing this mapping…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
