SAGE-RT: Synthetic Alignment data Generation for Safety Evaluation and   Red Teaming

Anurakt Kumar; Divyanshu Kumar; Jatan Loya; Nitin Aravind Birur; Tanay; Baswa; Sahil Agarwal; Prashanth Harshangi

arXiv:2408.11851·cs.AI·August 23, 2024

SAGE-RT: Synthetic Alignment data Generation for Safety Evaluation and Red Teaming

Anurakt Kumar, Divyanshu Kumar, Jatan Loya, Nitin Aravind Birur, Tanay, Baswa, Sahil Agarwal, Prashanth Harshangi

PDF

Open Access 1 Models

TL;DR

SAGE-RT is a novel pipeline that generates diverse, nuanced synthetic data for safety evaluation and red-teaming of large language models, covering a wide range of harmful topics and effectively identifying vulnerabilities.

Contribution

The paper introduces SAGE-RT, a detailed taxonomy-based method for synthetic data generation that overcomes limitations of existing approaches, enabling comprehensive safety testing.

Findings

01

Generated 51,000 prompt-response pairs covering 1,500+ topics.

02

Successfully jailbreaks state-of-the-art LLMs in most sub-categories.

03

Achieves 100% attack success rate on GPT-4o and GPT-3.5-turbo for harmfulness categories.

Abstract

We introduce Synthetic Alignment data Generation for Safety Evaluation and Red Teaming (SAGE-RT or SAGE) a novel pipeline for generating synthetic alignment and red-teaming data. Existing methods fall short in creating nuanced and diverse datasets, providing necessary control over the data generation and validation processes, or require large amount of manually generated seed data. SAGE addresses these limitations by using a detailed taxonomy to produce safety-alignment and red-teaming data across a wide range of topics. We generated 51,000 diverse and in-depth prompt-response pairs, encompassing over 1,500 topics of harmfulness and covering variations of the most frequent types of jailbreaking prompts faced by large language models (LLMs). We show that the red-teaming data generated through SAGE jailbreaks state-of-the-art LLMs in more than 27 out of 32 sub-categories, and in more than…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

🤗
enkryptai/DeepSeek-R1-Distill-Llama-8B-Enkrypt-Aligned
model· 3 dl· ♡ 6
3 dl♡ 6

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSoftware Reliability and Analysis Research

MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · {Dispute@FaQ-s}How to file a dispute with Expedia? · 15 Ways to Contact How can i speak to someone at Delta Airlines · Attention Is All You Need · Adam · Layer Normalization · Weight Decay · Dense Connections · Attention Dropout · Cosine Annealing