You Snooze, You Lose: Automatic Safety Alignment Restoration through Neural Weight Translation

Marco Arazzi; Vignesh Kumar Kembu; Antonino Nocera; Stjepan Picek; Saraga Sakthidharan

arXiv:2605.04992·cs.CR·May 7, 2026

You Snooze, You Lose: Automatic Safety Alignment Restoration through Neural Weight Translation

Marco Arazzi, Vignesh Kumar Kembu, Antonino Nocera, Stjepan Picek, Saraga Sakthidharan

PDF

TL;DR

This paper introduces NeWTral, a neural weight translation framework that safely aligns domain-specific adapters in LLMs without retraining, preserving expertise while reducing safety risks.

Contribution

NeWTral is a novel, parameter-space translation method that maps unsafe adapters to safe ones using a pre-trained non-linear module and MoE routing, avoiding data and retraining.

Findings

01

Reduces attack success rate from 70% to 13%.

02

Maintains 90% knowledge fidelity across models.

03

Operates without access to original training data.

Abstract

The open-source ecosystem has accelerated the democratization of Large Language Models (LLMs) through the public distribution of specialized Low-Rank Adaptation (LoRA) modules. However, integrating these third-party adapters often induces catastrophic forgetting of the base model's foundational safety alignment. Restoring these guardrails via fine-tuning on safety data introduces an opposing failure mode: the severe degradation of the specialized domain knowledge the adapter was originally designed to provide. To overcome this zero-resource challenge, we propose Neural Weight Translation (NeWTral), a framework that directly maps unsafe, domain-specific adapters onto a safe alignment manifold while rigorously preserving their core expertise. NeWTral operates as a non-linear translation module pre-trained on a diverse corpus of unsafe-to-safe adapter pairs. By executing this mapping…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.