SGuard-v1: Safety Guardrail for Large Language Models

JoonHo Lee; HyeonMin Cho; Jaewoong Yun; Hyunjae Lee; JunKyu Lee; Juree Seok

arXiv:2511.12497·cs.CL·November 18, 2025

SGuard-v1: Safety Guardrail for Large Language Models

JoonHo Lee, HyeonMin Cho, Jaewoong Yun, Hyunjae Lee, JunKyu Lee, Juree Seok

PDF

Open Access 2 Models

TL;DR

SGuard-v1 is a lightweight safety framework for large language models that detects harmful content and adversarial prompts, improving safety and interpretability with minimal deployment overhead.

Contribution

It introduces a dual-model safety guardrail for LLMs, trained on extensive datasets, achieving state-of-the-art safety performance while maintaining efficiency.

Findings

01

Achieves state-of-the-art safety benchmarks

02

Reduces deployment overhead compared to larger models

03

Provides interpretable safety predictions

Abstract

We present SGuard-v1, a lightweight safety guardrail for Large Language Models (LLMs), which comprises two specialized models to detect harmful content and screen adversarial prompts in human-AI conversational settings. The first component, ContentFilter, is trained to identify safety risks in LLM prompts and responses in accordance with the MLCommons hazard taxonomy, a comprehensive framework for trust and safety assessment of AI. The second component, JailbreakFilter, is trained with a carefully designed curriculum over integrated datasets and findings from prior work on adversarial prompting, covering 60 major attack types while mitigating false-unsafe classification. SGuard-v1 is built on the 2B-parameter Granite-3.3-2B-Instruct model that supports 12 languages. We curate approximately 1.4 million training instances from both collected and synthesized data and perform instruction…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdversarial Robustness in Machine Learning · Explainable Artificial Intelligence (XAI) · Ethics and Social Impacts of AI