SEAL: Safety-enhanced Aligned LLM Fine-tuning via Bilevel Data Selection

Han Shen; Pin-Yu Chen; Payel Das; Tianyi Chen

arXiv:2410.07471·cs.LG·October 14, 2024

SEAL: Safety-enhanced Aligned LLM Fine-tuning via Bilevel Data Selection

Han Shen, Pin-Yu Chen, Payel Das, Tianyi Chen

PDF

Open Access 1 Repo 3 Reviews

TL;DR

SEAL is a framework that improves the safety of LLM fine-tuning by selectively prioritizing safe and high-quality data through bilevel optimization, leading to better model safety and performance.

Contribution

Introduces a bilevel optimization-based data ranker to enhance safety during LLM fine-tuning, outperforming baseline data selection methods.

Findings

01

Models trained with SEAL show increased safety and quality.

02

Achieves 8.5% and 9.7% win rate improvements over baselines.

03

Demonstrates effectiveness on Llama-3-8b-Instruct and Merlinite-7b models.

Abstract

Fine-tuning on task-specific data to boost downstream performance is a crucial step for leveraging Large Language Models (LLMs). However, previous studies have demonstrated that fine-tuning the models on several adversarial samples or even benign data can greatly comprise the model's pre-equipped alignment and safety capabilities. In this work, we propose SEAL, a novel framework to enhance safety in LLM fine-tuning. SEAL learns a data ranker based on the bilevel optimization to up rank the safe and high-quality fine-tuning data and down rank the unsafe or low-quality ones. Models trained with SEAL demonstrate superior quality over multiple baselines, with 8.5% and 9.7% win rate increase compared to random selection respectively on Llama-3-8b-Instruct and Merlinite-7b models. Our code is available on github https://github.com/hanshen95/SEAL.

Peer Reviews

Decision·ICLR 2025 Poster

Reviewer 01Rating 6Confidence 4

Strengths

I believe that the proposed fine-tuning methodology for enhancing LLM safety addresses a timely and critical issue in AI, making it a highly suitable topic for ICLR. Moreover, this method has the potential to extend beyond the safety domain, offering a general and versatile approach applicable to various key areas, thereby holding significant value for the broader AI community. The learning and evaluation code is transparently and accessibly provided, and with further refinement post-accepta

Weaknesses

(1) Comment: One of the most significant weaknesses of this study lies in the lack of a fundamental explanation for why SEAL achieves performance improvements beyond using safety-aligned data alone. For instance, if the fine-tuning dataset consists entirely of non-safety-aligned data, it remains unclear why SEAL, when using 𝛾>0, would still mitigate performance degradation in safety compared to a setup with 𝛾=0. Ideally, both should yield comparable safety performance. To address this, the auth

Reviewer 02Rating 6Confidence 4

Strengths

This paper investigates an important topic: safety alignment in the supervised finetuning process of the LLM. The problem is a valid concern and the proposed data selection method is well-motivated. Although data selection is not novel by itself, the proposed method appears to be more straightforward, computation-efficient, and more effective comparing to previous methods. Experiments are conducted on various datasets and models, showing the transferability of the selected data across the models

Weaknesses

One potential drawback of this paper is it assume the safe dataset is readily available in the supervised finetuning process. With the safe dataset available, a straightforward idea would be directly optimize the model with a combination of safe data and task data, as formulated in Equation (4). Since this model update is also performed in the data selector training anyway, it is unclear if the data selection is realy needed to achieve improved safety. On the other hand, from algorithm 2 the dat

Reviewer 03Rating 5Confidence 3

Strengths

1. The paper proposes a security-enhanced fine-tuning framework named SEAL, specifically designed to mitigate the negative impacts on model security alignment during the fine-tuning process. 2. SEAL employs a novel two-layer optimization method to learn a data selector, which effectively ranks fine-tuning data. This process increases the emphasis on safe and high-quality data while reducing the influence of unsafe or low-quality data, thus improving fine-tuning safety. 3. The data selector used

Weaknesses

1. Lack of novelty. This paper presents a method for training data selectors and weighting losses, which is very straightforward. At the same time, the paper spends a lot of time on the optimization algorithm, but the optimization algorithm itself lacks improvement and innovation compared to the referenced papers. Overall, the article's approach lacks novelty, both in the idea itself and in the optimization method. 2. For all experiments, there was no adjustment for the variance of performance g

Code & Models

Repositories

hanshen95/seal
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNuclear reactor physics and engineering · Magnetic confinement fusion research