CourtGuard: A Model-Agnostic Framework for Zero-Shot Policy Adaptation in LLM Safety

Umid Suleymanov; Rufiz Bayramov; Suad Gafarli; Seljan Musayeva; Taghi Mammadov; Aynur Akhundlu; Murat Kantarcioglu

arXiv:2602.22557·cs.AI·February 27, 2026

CourtGuard: A Model-Agnostic Framework for Zero-Shot Policy Adaptation in LLM Safety

Umid Suleymanov, Rufiz Bayramov, Suad Gafarli, Seljan Musayeva, Taghi Mammadov, Aynur Akhundlu, Murat Kantarcioglu

PDF

Open Access

TL;DR

CourtGuard introduces a retrieval-augmented, debate-based framework for LLM safety that enables zero-shot policy adaptation and automated auditing without retraining, outperforming traditional static classifiers.

Contribution

We propose a novel, model-agnostic framework that redefines safety evaluation as evidentiary debate, allowing zero-shot policy adaptation and automated dataset curation in LLM safety.

Findings

01

Achieved state-of-the-art performance on 7 safety benchmarks.

02

Demonstrated 90% accuracy on out-of-domain Wikipedia Vandalism task.

03

Enabled automated curation of nine adversarial attack datasets.

Abstract

Current safety mechanisms for Large Language Models (LLMs) rely heavily on static, fine-tuned classifiers that suffer from adaptation rigidity, the inability to enforce new governance rules without expensive retraining. To address this, we introduce CourtGuard, a retrieval-augmented multi-agent framework that reimagines safety evaluation as Evidentiary Debate. By orchestrating an adversarial debate grounded in external policy documents, CourtGuard achieves state-of-the-art performance across 7 safety benchmarks, outperforming dedicated policy-following baselines without fine-tuning. Beyond standard metrics, we highlight two critical capabilities: (1) Zero-Shot Adaptability, where our framework successfully generalized to an out-of-domain Wikipedia Vandalism task (achieving 90\% accuracy) by swapping the reference policy; and (2) Automated Data Curation and Auditing, where we leveraged…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdversarial Robustness in Machine Learning · Ethics and Social Impacts of AI · Topic Modeling