Adaptive Defense against Harmful Fine-Tuning for Large Language Models via Bayesian Data Scheduler

Zixuan Hu; Li Shen; Zhenyi Wang; Yongxian Wei; Dacheng Tao

arXiv:2510.27172·cs.LG·November 3, 2025

Adaptive Defense against Harmful Fine-Tuning for Large Language Models via Bayesian Data Scheduler

Zixuan Hu, Li Shen, Zhenyi Wang, Yongxian Wei, Dacheng Tao

PDF

Open Access

TL;DR

This paper introduces Bayesian Data Scheduler, an adaptive defense method for large language models that mitigates harmful fine-tuning without attack simulation by learning safety attributes through Bayesian inference, achieving state-of-the-art results.

Contribution

Proposes a novel Bayesian inference-based data scheduling method for adaptive defense against harmful fine-tuning in large language models, eliminating the need for attack simulation.

Findings

01

Outperforms existing defenses across diverse attack scenarios

02

Effectively adapts to specific fine-tuning datasets

03

Achieves state-of-the-art robustness in experiments

Abstract

Harmful fine-tuning poses critical safety risks to fine-tuning-as-a-service for large language models. Existing defense strategies preemptively build robustness via attack simulation but suffer from fundamental limitations: (i) the infeasibility of extending attack simulations beyond bounded threat models due to the inherent difficulty of anticipating unknown attacks, and (ii) limited adaptability to varying attack settings, as simulation fails to capture their variability and complexity. To address these challenges, we propose Bayesian Data Scheduler (BDS), an adaptive tuning-stage defense strategy with no need for attack simulation. BDS formulates harmful fine-tuning defense as a Bayesian inference problem, learning the posterior distribution of each data point's safety attribute, conditioned on the fine-tuning and alignment datasets. The fine-tuning process is then constrained by…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdversarial Robustness in Machine Learning · Topic Modeling · Information and Cyber Security