Curvature-Aware Safety Restoration In LLMs Fine-Tuning

Thong Bach; Thanh Nguyen-Tang; Dung Nguyen; Thao Minh Le; Truyen Tran

arXiv:2511.18039·cs.LG·November 25, 2025

Curvature-Aware Safety Restoration In LLMs Fine-Tuning

Thong Bach, Thanh Nguyen-Tang, Dung Nguyen, Thao Minh Le, Truyen Tran

PDF

Open Access

TL;DR

This paper introduces a curvature-aware method for fine-tuning LLMs that restores safety alignment by leveraging loss landscape geometry, effectively reducing harmful outputs without sacrificing task performance.

Contribution

It uncovers the preservation of loss landscape geometry related to safety in fine-tuned LLMs and proposes a novel curvature-aware alignment restoration technique using influence functions and second-order optimization.

Findings

01

Reduces harmful responses across multiple models and settings.

02

Maintains or improves task performance and few-shot learning.

03

Efficiently balances safety and utility in LLM fine-tuning.

Abstract

Fine-tuning Large Language Models (LLMs) for downstream tasks often compromises safety alignment, even when using parameter-efficient methods like LoRA. In this work, we uncover a notable property: fine-tuned models preserve the geometric structure of their loss landscapes concerning harmful content, regardless of the fine-tuning method employed. This suggests that safety behaviors are not erased but shifted to less influential regions of the parameter space. Building on this insight, we propose a curvature-aware alignment restoration method that leverages influence functions and second-order optimization to selectively increase loss on harmful inputs while preserving task performance. By navigating the shared geometry between base and fine-tuned models, our method discourages unsafe outputs while preserving task-relevant performance, avoiding full reversion and enabling precise,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdversarial Robustness in Machine Learning · Topic Modeling · Domain Adaptation and Few-Shot Learning