Curvature-Aware Safety Restoration In LLMs Fine-Tuning
Thong Bach, Thanh Nguyen-Tang, Dung Nguyen, Thao Minh Le, Truyen Tran

TL;DR
This paper introduces a curvature-aware method for fine-tuning LLMs that restores safety alignment by leveraging loss landscape geometry, effectively reducing harmful outputs without sacrificing task performance.
Contribution
It uncovers the preservation of loss landscape geometry related to safety in fine-tuned LLMs and proposes a novel curvature-aware alignment restoration technique using influence functions and second-order optimization.
Findings
Reduces harmful responses across multiple models and settings.
Maintains or improves task performance and few-shot learning.
Efficiently balances safety and utility in LLM fine-tuning.
Abstract
Fine-tuning Large Language Models (LLMs) for downstream tasks often compromises safety alignment, even when using parameter-efficient methods like LoRA. In this work, we uncover a notable property: fine-tuned models preserve the geometric structure of their loss landscapes concerning harmful content, regardless of the fine-tuning method employed. This suggests that safety behaviors are not erased but shifted to less influential regions of the parameter space. Building on this insight, we propose a curvature-aware alignment restoration method that leverages influence functions and second-order optimization to selectively increase loss on harmful inputs while preserving task performance. By navigating the shared geometry between base and fine-tuned models, our method discourages unsafe outputs while preserving task-relevant performance, avoiding full reversion and enabling precise,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdversarial Robustness in Machine Learning · Topic Modeling · Domain Adaptation and Few-Shot Learning
