Guardian-as-an-Advisor: Advancing Next-Generation Guardian Models for Trustworthy LLMs
Yue Huang, Haomin Zhuang, Jiayi Ye, Han Bao, Yanbo Wang, Hang Hua, Siyuan Wu, Pin-Yu Chen, Xiangliang Zhang

TL;DR
This paper introduces Guardian-as-an-Advisor, a soft-gating system that improves safety and robustness of large language models by providing risk assessments and advice without over-refusal.
Contribution
It presents a novel soft-gating pipeline and a large dataset, GuardSet, for training and evaluating safety and robustness in language models.
Findings
GuardAdvisor achieves high detection accuracy for harmful content.
Augmented prompts with Guardian advice improve model responses.
Inference overhead remains below 10%, ensuring efficiency.
Abstract
Hard-gated safety checkers often over-refuse and misalign with a vendor's model spec; prevailing taxonomies also neglect robustness and honesty, yielding safer-on-paper yet less useful systems. This work introduces Guardian-as-an-Advisor (GaaA), a soft-gating pipeline where a guardian predicts a binary risk label plus a concise explanation and prepends this advice to the original query for re-inference, keeping the base model operating under its original spec. To support training and evaluation, GuardSet is constructed, a 208k+ multi-domain dataset unifying harmful and harmless cases with targeted robustness and honesty slices. GuardAdvisor is trained via SFT followed by RL to enforce label-explanation consistency. GuardAdvisor attains competitive detection accuracy while enabling the advisory workflow; when used to augment inputs, responses improve over unaugmented prompts. A latency…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
