CourtGuard: A Model-Agnostic Framework for Zero-Shot Policy Adaptation in LLM Safety
Umid Suleymanov, Rufiz Bayramov, Suad Gafarli, Seljan Musayeva, Taghi Mammadov, Aynur Akhundlu, Murat Kantarcioglu

TL;DR
CourtGuard introduces a retrieval-augmented, debate-based framework for LLM safety that enables zero-shot policy adaptation and automated auditing without retraining, outperforming traditional static classifiers.
Contribution
We propose a novel, model-agnostic framework that redefines safety evaluation as evidentiary debate, allowing zero-shot policy adaptation and automated dataset curation in LLM safety.
Findings
Achieved state-of-the-art performance on 7 safety benchmarks.
Demonstrated 90% accuracy on out-of-domain Wikipedia Vandalism task.
Enabled automated curation of nine adversarial attack datasets.
Abstract
Current safety mechanisms for Large Language Models (LLMs) rely heavily on static, fine-tuned classifiers that suffer from adaptation rigidity, the inability to enforce new governance rules without expensive retraining. To address this, we introduce CourtGuard, a retrieval-augmented multi-agent framework that reimagines safety evaluation as Evidentiary Debate. By orchestrating an adversarial debate grounded in external policy documents, CourtGuard achieves state-of-the-art performance across 7 safety benchmarks, outperforming dedicated policy-following baselines without fine-tuning. Beyond standard metrics, we highlight two critical capabilities: (1) Zero-Shot Adaptability, where our framework successfully generalized to an out-of-domain Wikipedia Vandalism task (achieving 90\% accuracy) by swapping the reference policy; and (2) Automated Data Curation and Auditing, where we leveraged…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdversarial Robustness in Machine Learning · Ethics and Social Impacts of AI · Topic Modeling
