A Lightweight Explainable Guardrail for Prompt Safety
Md Asiful Islam, Mihai Surdeanu

TL;DR
This paper introduces LEG, a lightweight, explainable guardrail method that detects unsafe prompts using a multi-task learning approach with synthetic explanations, achieving state-of-the-art performance with smaller models.
Contribution
LEG is a novel, small-sized model that jointly learns prompt safety classification and explanation generation, counteracting biases with synthetic data and a new loss function.
Findings
LEG matches or exceeds state-of-the-art accuracy in prompt safety detection.
LEG provides explainability by labeling prompt words that influence safety decisions.
LEG maintains high performance both in-domain and out-of-domain across multiple datasets.
Abstract
We propose a lightweight explainable guardrail (LEG) method to detect unsafe prompts. LEG uses a multi-task learning architecture to jointly learn a prompt classifier and an explanation classifier, where the latter labels prompt words that explain the safe/unsafe overall decision. LEG is trained on synthetic explanation data, which is generated using a novel strategy that counteracts the confirmation biases of LLMs. Lastly, LEG's training process uses a novel loss that captures global explanation signals as a weak supervision and combines cross-entropy and focal losses with uncertainty-based weighting. LEG obtains equivalent or better performance than the state-of-the-art for both prompt classification and explainability, both in-domain and out-of-domain on three datasets, despite the fact that its model size is considerably smaller than current approaches.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- 🤗clulab/LEG-1.0-aegis2.0-basemodel· 7 dl7 dl
- 🤗clulab/LEG-1.0-aegis2.0-largemodel· 5 dl5 dl
- 🤗clulab/LEG-1.0-aegis2.0-xsmodel· 5 dl5 dl
- 🤗clulab/LEG-1.0-toxicchat0124-basemodel· 7 dl7 dl
- 🤗clulab/LEG-1.0-toxicchat0124-largemodel· 4 dl4 dl
- 🤗clulab/LEG-1.0-toxicchat0124-xsmodel· 5 dl5 dl
- 🤗clulab/LEG-1.0-wildguardmix-basemodel· 8 dl8 dl
- 🤗clulab/LEG-1.0-wildguardmix-largemodel· 5 dl5 dl
- 🤗clulab/LEG-1.0-wildguardmix-xsmodel· 5 dl5 dl
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
