Safeguarding LLM Fine-tuning via Push-Pull Distributional Alignment

Haozhong Wang; Zhuo Li; Yibo Yang; He Zhao; Hongyuan Zha; Dandan Guo

arXiv:2601.07200·cs.LG·January 13, 2026

Safeguarding LLM Fine-tuning via Push-Pull Distributional Alignment

Haozhong Wang, Zhuo Li, Yibo Yang, He Zhao, Hongyuan Zha, Dandan Guo

PDF

Open Access

TL;DR

This paper introduces Safety Optimal Transport (SOT), a distribution-level alignment method that improves LLM safety during fine-tuning by actively pulling safe data and pushing harmful data away, resulting in better safety without sacrificing performance.

Contribution

The paper proposes a novel distribution-level alignment framework using optimal transport to enhance LLM safety during fine-tuning, moving beyond heuristic instance filtering.

Findings

01

SOT significantly improves model safety across diverse domains.

02

SOT maintains competitive downstream performance.

03

SOT achieves a superior safety-utility trade-off.

Abstract

The inherent safety alignment of Large Language Models (LLMs) is prone to erosion during fine-tuning, even when using seemingly innocuous datasets. While existing defenses attempt to mitigate this via data selection, they typically rely on heuristic, instance-level assessments that neglect the global geometry of the data distribution and fail to explicitly repel harmful patterns. To address this, we introduce Safety Optimal Transport (SOT), a novel framework that reframes safe fine-tuning from an instance-level filtering challenge to a distribution-level alignment task grounded in Optimal Transport (OT). At its core is a dual-reference ``push-pull'' weight-learning mechanism: SOT optimizes sample importance by actively pulling the downstream distribution towards a trusted safe anchor while simultaneously pushing it away from a general harmful reference. This establishes a robust…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdversarial Robustness in Machine Learning · Topic Modeling · Explainable Artificial Intelligence (XAI)