Reverse Constitutional AI: A Framework for Controllable Toxic Data Generation via Probability-Clamped RLAIF

Yuan Fang; Yiming Luo; Aimin Zhou; Fei Tan

arXiv:2604.17769·cs.CL·April 21, 2026

Reverse Constitutional AI: A Framework for Controllable Toxic Data Generation via Probability-Clamped RLAIF

Yuan Fang, Yiming Luo, Aimin Zhou, Fei Tan

PDF

TL;DR

This paper introduces R-CAI, a framework for automated, controllable toxic data generation for LLM safety testing, using a critique-revision pipeline and probability clamping to improve coherence and adversarial effectiveness.

Contribution

R-CAI innovatively inverts a harmless constitution into a toxicity framework and employs probability clamping to stabilize adversarial data generation without human annotation.

Findings

01

R-CAI generates diverse, high-quality toxic data.

02

Probability clamping improves semantic coherence by 15%.

03

R-CAI enables scalable, automated safety evaluation.

Abstract

Ensuring the safety of large language models (LLMs) requires robust red teaming, yet the systematic synthesis of high-quality toxic data remains under-explored. We propose Reverse Constitutional AI (R-CAI), a framework for automated and controllable adversarial data generation that moves beyond isolated jailbreak prompts. By inverting a harmless constitution into a constitution of toxicity and iteratively refining model outputs through a critique--revision pipeline, R-CAI enables scalable synthesis of multi-dimensional adversarial data without human annotation. Optimizing solely for toxicity-related rewards, however, can lead to reward hacking and degraded semantic coherence. To address this challenge, we introduce probability clamping within reinforcement learning from AI feedback, which stabilizes adversarial optimization while preserving adversarial intent. Experiments demonstrate…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.