$R^2$-Guard: Robust Reasoning Enabled LLM Guardrail via Knowledge-Enhanced Logical Reasoning
Mintong Kang, Bo Li

TL;DR
The paper introduces $R^2$-Guard, a novel safety guardrail for LLMs that combines data-driven safety assessments with logical reasoning based on safety knowledge, improving robustness and effectiveness.
Contribution
It proposes a knowledge-enhanced logical reasoning framework for LLM safety guardrails, integrating probabilistic graphical models with safety knowledge to address limitations of existing methods.
Findings
$R^2$-Guard outperforms eight strong guardrail models on six benchmarks.
It significantly improves robustness against jailbreaking attacks.
Achieves 30.2% and 59.5% improvements over SOTA methods in key evaluations.
Abstract
As LLMs become increasingly prevalent across various applications, it is critical to establish safety guardrails to moderate input/output content of LLMs. Existing guardrail models treat various safety categories independently and fail to explicitly capture the intercorrelations among them. This has led to limitations such as ineffectiveness due to inadequate training on long-tail data from correlated safety categories, susceptibility to jailbreaking attacks, and inflexibility regarding new safety categories. To address these limitations, we propose -Guard, a robust reasoning enabled LLM guardrail via knowledge-enhanced logical reasoning. Specifically, -Guard comprises two parts: data-driven category-specific learning and reasoning components. The data-driven guardrail models provide unsafety probabilities of moderated content on different safety categories. We then encode…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsLogic, Reasoning, and Knowledge
