Refining Positive and Toxic Samples for Dual Safety Self-Alignment of   LLMs with Minimal Human Interventions

Jingxin Xu; Guoshun Nan; Sheng Guan; Sicong Leng; Yilian Liu; Zixiao; Wang; Yuyang Ma; Zhili Zhou; Yanzhao Hou; Xiaofeng Tao

arXiv:2502.08657·cs.CL·February 14, 2025

Refining Positive and Toxic Samples for Dual Safety Self-Alignment of LLMs with Minimal Human Interventions

Jingxin Xu, Guoshun Nan, Sheng Guan, Sicong Leng, Yilian Liu, Zixiao, Wang, Yuyang Ma, Zhili Zhou, Yanzhao Hou, Xiaofeng Tao

PDF

Open Access

TL;DR

This paper introduces PT-ALIGN, a safety self-alignment method for LLMs that automatically refines positive and toxic samples with minimal human intervention, improving safety without sacrificing helpfulness.

Contribution

The paper presents a novel approach that leverages LLMs to automatically generate and refine safety-related training samples, reducing manual annotation efforts and enhancing safety alignment.

Findings

01

PT-ALIGN effectively improves safety alignment across 9 open-source LLMs.

02

The method maintains helpfulness and usefulness while enhancing safety.

03

Iterative sample refinement requires fewer than 50 human annotations.

Abstract

Recent AI agents, such as ChatGPT and LLaMA, primarily rely on instruction tuning and reinforcement learning to calibrate the output of large language models (LLMs) with human intentions, ensuring the outputs are harmless and helpful. Existing methods heavily depend on the manual annotation of high-quality positive samples, while contending with issues such as noisy labels and minimal distinctions between preferred and dispreferred response data. However, readily available toxic samples with clear safety distinctions are often filtered out, removing valuable negative references that could aid LLMs in safety alignment. In response, we propose PT-ALIGN, a novel safety self-alignment approach that minimizes human supervision by automatically refining positive and toxic samples and performing fine-grained dual instruction tuning. Positive samples are harmless responses, while toxic samples…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSafety Systems Engineering in Autonomy · Manufacturing Process and Optimization · Occupational Health and Safety Research

MethodsLLaMA