A Flexible Large Language Models Guardrail Development Methodology Applied to Off-Topic Prompt Detection
Gabriel Chua, Shing Yee Chan, Shaun Khoo

TL;DR
This paper presents a flexible, data-free methodology for developing off-topic prompt guardrails for large language models, utilizing synthetic datasets generated by LLMs to improve safety and reduce false positives.
Contribution
It introduces a novel, adaptable guardrail development approach that does not require real-world data and generalizes across misuse categories, with open-sourced resources for the community.
Findings
Outperforms heuristic guardrails in off-topic detection
Generalizes to jailbreak and harmful prompt detection
Provides open-source datasets and models
Abstract
Large Language Models (LLMs) are prone to off-topic misuse, where users may prompt these models to perform tasks beyond their intended scope. Current guardrails, which often rely on curated examples or custom classifiers, suffer from high false-positive rates, limited adaptability, and the impracticality of requiring real-world data that is not available in pre-production. In this paper, we introduce a flexible, data-free guardrail development methodology that addresses these challenges. By thoroughly defining the problem space qualitatively and passing this to an LLM to generate diverse prompts, we construct a synthetic dataset to benchmark and train off-topic guardrails that outperform heuristic approaches. Additionally, by framing the task as classifying whether the user prompt is relevant with respect to the system prompt, our guardrails effectively generalize to other misuse…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Malware Detection Techniques · Web Data Mining and Analysis · Software Engineering Research
