Safeguarding LLM Fine-tuning via Push-Pull Distributional Alignment
Haozhong Wang, Zhuo Li, Yibo Yang, He Zhao, Hongyuan Zha, Dandan Guo

TL;DR
This paper introduces Safety Optimal Transport (SOT), a distribution-level alignment method that improves LLM safety during fine-tuning by actively pulling safe data and pushing harmful data away, resulting in better safety without sacrificing performance.
Contribution
The paper proposes a novel distribution-level alignment framework using optimal transport to enhance LLM safety during fine-tuning, moving beyond heuristic instance filtering.
Findings
SOT significantly improves model safety across diverse domains.
SOT maintains competitive downstream performance.
SOT achieves a superior safety-utility trade-off.
Abstract
The inherent safety alignment of Large Language Models (LLMs) is prone to erosion during fine-tuning, even when using seemingly innocuous datasets. While existing defenses attempt to mitigate this via data selection, they typically rely on heuristic, instance-level assessments that neglect the global geometry of the data distribution and fail to explicitly repel harmful patterns. To address this, we introduce Safety Optimal Transport (SOT), a novel framework that reframes safe fine-tuning from an instance-level filtering challenge to a distribution-level alignment task grounded in Optimal Transport (OT). At its core is a dual-reference ``push-pull'' weight-learning mechanism: SOT optimizes sample importance by actively pulling the downstream distribution towards a trusted safe anchor while simultaneously pushing it away from a general harmful reference. This establishes a robust…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdversarial Robustness in Machine Learning · Topic Modeling · Explainable Artificial Intelligence (XAI)
