Adaptive Defense against Harmful Fine-Tuning for Large Language Models via Bayesian Data Scheduler
Zixuan Hu, Li Shen, Zhenyi Wang, Yongxian Wei, Dacheng Tao

TL;DR
This paper introduces Bayesian Data Scheduler, an adaptive defense method for large language models that mitigates harmful fine-tuning without attack simulation by learning safety attributes through Bayesian inference, achieving state-of-the-art results.
Contribution
Proposes a novel Bayesian inference-based data scheduling method for adaptive defense against harmful fine-tuning in large language models, eliminating the need for attack simulation.
Findings
Outperforms existing defenses across diverse attack scenarios
Effectively adapts to specific fine-tuning datasets
Achieves state-of-the-art robustness in experiments
Abstract
Harmful fine-tuning poses critical safety risks to fine-tuning-as-a-service for large language models. Existing defense strategies preemptively build robustness via attack simulation but suffer from fundamental limitations: (i) the infeasibility of extending attack simulations beyond bounded threat models due to the inherent difficulty of anticipating unknown attacks, and (ii) limited adaptability to varying attack settings, as simulation fails to capture their variability and complexity. To address these challenges, we propose Bayesian Data Scheduler (BDS), an adaptive tuning-stage defense strategy with no need for attack simulation. BDS formulates harmful fine-tuning defense as a Bayesian inference problem, learning the posterior distribution of each data point's safety attribute, conditioned on the fine-tuning and alignment datasets. The fine-tuning process is then constrained by…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdversarial Robustness in Machine Learning · Topic Modeling · Information and Cyber Security
