Learning to Stay Safe: Adaptive Regularization Against Safety Degradation during Fine-Tuning

Jyotin Goel; Souvik Maji; Pratik Mazumder

arXiv:2602.17546·cs.CL·May 12, 2026

Learning to Stay Safe: Adaptive Regularization Against Safety Degradation during Fine-Tuning

Jyotin Goel, Souvik Maji, Pratik Mazumder

PDF

TL;DR

This paper presents an adaptive regularization framework for fine-tuning language models, which maintains safety and utility by estimating safety risks during training and constraining risky updates.

Contribution

It introduces two novel safety risk estimation methods—judge-based and activation-based—that enable models to stay aligned during fine-tuning without inference-time costs.

Findings

01

Adaptive regularization reduces attack success rate across models.

02

Safety risk signals are predictable from model activations.

03

The approach preserves downstream performance.

Abstract

Instruction-following language models are trained to be helpful and safe, yet their safety behavior can deteriorate under benign fine-tuning and worsen under adversarial updates. Existing defenses often offer limited protection or force a trade-off between safety and utility. We introduce a training framework that adapts regularization in response to safety risk, enabling models to remain aligned throughout fine-tuning. To estimate safety risk at training time, we explore two distinct approaches: a judge-based Safety Critic that assigns high-level harm scores to training batches, and an activation-based risk predictor built with a lightweight classifier trained on intermediate model activations to estimate harmful intent. Each approach provides a risk signal that is used to constrain updates deemed higher risk to remain close to a safe reference policy, while lower-risk updates proceed…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.