SAGE-RT: Synthetic Alignment data Generation for Safety Evaluation and Red Teaming
Anurakt Kumar, Divyanshu Kumar, Jatan Loya, Nitin Aravind Birur, Tanay, Baswa, Sahil Agarwal, Prashanth Harshangi

TL;DR
SAGE-RT is a novel pipeline that generates diverse, nuanced synthetic data for safety evaluation and red-teaming of large language models, covering a wide range of harmful topics and effectively identifying vulnerabilities.
Contribution
The paper introduces SAGE-RT, a detailed taxonomy-based method for synthetic data generation that overcomes limitations of existing approaches, enabling comprehensive safety testing.
Findings
Generated 51,000 prompt-response pairs covering 1,500+ topics.
Successfully jailbreaks state-of-the-art LLMs in most sub-categories.
Achieves 100% attack success rate on GPT-4o and GPT-3.5-turbo for harmfulness categories.
Abstract
We introduce Synthetic Alignment data Generation for Safety Evaluation and Red Teaming (SAGE-RT or SAGE) a novel pipeline for generating synthetic alignment and red-teaming data. Existing methods fall short in creating nuanced and diverse datasets, providing necessary control over the data generation and validation processes, or require large amount of manually generated seed data. SAGE addresses these limitations by using a detailed taxonomy to produce safety-alignment and red-teaming data across a wide range of topics. We generated 51,000 diverse and in-depth prompt-response pairs, encompassing over 1,500 topics of harmfulness and covering variations of the most frequent types of jailbreaking prompts faced by large language models (LLMs). We show that the red-teaming data generated through SAGE jailbreaks state-of-the-art LLMs in more than 27 out of 32 sub-categories, and in more than…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSoftware Reliability and Analysis Research
MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · {Dispute@FaQ-s}How to file a dispute with Expedia? · 15 Ways to Contact How can i speak to someone at Delta Airlines · Attention Is All You Need · Adam · Layer Normalization · Weight Decay · Dense Connections · Attention Dropout · Cosine Annealing
