A Lightweight Explainable Guardrail for Prompt Safety

Md Asiful Islam; Mihai Surdeanu

arXiv:2602.15853·cs.CL·April 28, 2026

A Lightweight Explainable Guardrail for Prompt Safety

Md Asiful Islam, Mihai Surdeanu

PDF

9 Models 3 Datasets

TL;DR

This paper introduces LEG, a lightweight, explainable guardrail method that detects unsafe prompts using a multi-task learning approach with synthetic explanations, achieving state-of-the-art performance with smaller models.

Contribution

LEG is a novel, small-sized model that jointly learns prompt safety classification and explanation generation, counteracting biases with synthetic data and a new loss function.

Findings

01

LEG matches or exceeds state-of-the-art accuracy in prompt safety detection.

02

LEG provides explainability by labeling prompt words that influence safety decisions.

03

LEG maintains high performance both in-domain and out-of-domain across multiple datasets.

Abstract

We propose a lightweight explainable guardrail (LEG) method to detect unsafe prompts. LEG uses a multi-task learning architecture to jointly learn a prompt classifier and an explanation classifier, where the latter labels prompt words that explain the safe/unsafe overall decision. LEG is trained on synthetic explanation data, which is generated using a novel strategy that counteracts the confirmation biases of LLMs. Lastly, LEG's training process uses a novel loss that captures global explanation signals as a weak supervision and combines cross-entropy and focal losses with uncertainty-based weighting. LEG obtains equivalent or better performance than the state-of-the-art for both prompt classification and explainability, both in-domain and out-of-domain on three datasets, despite the fact that its model size is considerably smaller than current approaches.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

Datasets

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.