TL;DR
BARRED is a novel framework that generates high-quality synthetic training data for custom policy guardrails using debate and domain decomposition, reducing reliance on human labels.
Contribution
It introduces a debate-based, domain decomposition approach to create faithful synthetic data for fine-tuning models to enforce custom policies.
Findings
Synthetic data from BARRED improves guardrail performance over proprietary LLMs.
Debate and dimension decomposition are essential for data diversity and fidelity.
BARRED reduces the need for extensive human annotation in training custom classifiers.
Abstract
Deploying guardrails for custom policies remains challenging, as generic safety models fail to capture task-specific requirements, while prompting LLMs suffers from inconsistent boundary-case performance and high inference costs. Training custom classifiers achieves both accuracy and efficiency, yet demands substantial labeled data that is costly to obtain. We present BARRED (Boundary Alignment Refinement through REflection and Debate), a framework for generating faithful and diverse synthetic training data using only a task description and a small set of unlabeled examples. Our approach decomposes the domain space into dimensions to ensure comprehensive coverage, and employs multi-agent debate to verify label correctness, yielding a high-fidelity training corpus. Experiments across diverse custom policies demonstrate that small language models finetuned on our synthetic data…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
