SafeTuneBed: A Toolkit for Benchmarking LLM Safety Alignment in Fine-Tuning

Saad Hossain; Samanvay Vajpayee; Sirisha Rambhatla

arXiv:2506.00676·cs.LG·June 3, 2025

SafeTuneBed: A Toolkit for Benchmarking LLM Safety Alignment in Fine-Tuning

Saad Hossain, Samanvay Vajpayee, Sirisha Rambhatla

PDF

Open Access

TL;DR

SafeTuneBed is a comprehensive benchmark and toolkit that standardizes evaluation of safety, utility, and robustness in fine-tuning large language models, facilitating fair comparisons across diverse methods and defenses.

Contribution

It introduces a unified platform for benchmarking LLM safety alignment, integrating datasets, defenses, and evaluation metrics for consistent and reproducible assessments.

Findings

01

Benchmarking various defenses across multiple tasks

02

Demonstrates variability in safety and utility outcomes

03

Provides a standardized framework for future research

Abstract

As large language models (LLMs) become ubiquitous, parameter-efficient fine-tuning methods and safety-first defenses have proliferated rapidly. However, the number of approaches and their recent increase have resulted in diverse evaluations-varied datasets, metrics, and inconsistent threat settings-making it difficult to fairly compare safety, utility, and robustness across methods. To address this, we introduce SafeTuneBed, a benchmark and toolkit unifying fine-tuning and defense evaluation. SafeTuneBed (i) curates a diverse repository of multiple fine-tuning datasets spanning sentiment analysis, question-answering, multi-step reasoning, and open-ended instruction tasks, and allows for the generation of harmful-variant splits; (ii) enables integration of state-of-the-art defenses, including alignment-stage immunization, in-training safeguards, and post-tuning repair; and (iii) provides…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSafety Systems Engineering in Autonomy · Risk and Safety Analysis · Digital Rights Management and Security