Lightweight Safety Guardrails via Synthetic Data and RL-guided Adversarial Training

Aleksei Ilin; Gor Matevosyan; Xueying Ma; Vladimir Eremin; Suhaa Dada; Muqun Li; Riyaaz Shaik; Haluk Noyan Tokgozoglu

arXiv:2507.08284·cs.LG·July 14, 2025

Lightweight Safety Guardrails via Synthetic Data and RL-guided Adversarial Training

Aleksei Ilin, Gor Matevosyan, Xueying Ma, Vladimir Eremin, Suhaa Dada, Muqun Li, Riyaaz Shaik, Haluk Noyan Tokgozoglu

PDF

Open Access

TL;DR

This paper presents a lightweight safety guardrail framework for language models that uses synthetic data and RL-guided adversarial training to improve content moderation, outperforming larger models in efficiency and robustness.

Contribution

The paper introduces a novel approach combining synthetic data generation and adversarial training with reinforcement learning to enhance small language models for safety tasks.

Findings

01

Small models achieve comparable or better safety performance than larger models.

02

Synthetic data and adversarial training improve robustness against harmful content.

03

Framework reduces computational costs while maintaining high effectiveness.

Abstract

We introduce a lightweight yet highly effective safety guardrail framework for language models, demonstrating that small-scale language models can achieve, and even surpass, the performance of larger counterparts in content moderation tasks. This is accomplished through high-fidelity synthetic data generation and adversarial training. The synthetic data generation process begins with human-curated seed data, which undergoes query augmentation and paraphrasing to create diverse and contextually rich examples. This augmented data is then subjected to multiple rounds of curation, ensuring high fidelity and relevance. Inspired by recent advances in the Generative Adversarial Network (GAN) architecture, our adversarial training employs reinforcement learning to guide a generator that produces challenging synthetic examples. These examples are used to fine-tune the safety classifier,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdversarial Robustness in Machine Learning · Hate Speech and Cyberbullying Detection · Topic Modeling