$R^2$-Guard: Robust Reasoning Enabled LLM Guardrail via   Knowledge-Enhanced Logical Reasoning

Mintong Kang; Bo Li

arXiv:2407.05557·cs.AI·July 9, 2024·1 cites

$R^2$-Guard: Robust Reasoning Enabled LLM Guardrail via Knowledge-Enhanced Logical Reasoning

Mintong Kang, Bo Li

PDF

Open Access 1 Repo

TL;DR

The paper introduces $R^2$-Guard, a novel safety guardrail for LLMs that combines data-driven safety assessments with logical reasoning based on safety knowledge, improving robustness and effectiveness.

Contribution

It proposes a knowledge-enhanced logical reasoning framework for LLM safety guardrails, integrating probabilistic graphical models with safety knowledge to address limitations of existing methods.

Findings

01

$R^2$-Guard outperforms eight strong guardrail models on six benchmarks.

02

It significantly improves robustness against jailbreaking attacks.

03

Achieves 30.2% and 59.5% improvements over SOTA methods in key evaluations.

Abstract

As LLMs become increasingly prevalent across various applications, it is critical to establish safety guardrails to moderate input/output content of LLMs. Existing guardrail models treat various safety categories independently and fail to explicitly capture the intercorrelations among them. This has led to limitations such as ineffectiveness due to inadequate training on long-tail data from correlated safety categories, susceptibility to jailbreaking attacks, and inflexibility regarding new safety categories. To address these limitations, we propose $R^{2}$ -Guard, a robust reasoning enabled LLM guardrail via knowledge-enhanced logical reasoning. Specifically, $R^{2}$ -Guard comprises two parts: data-driven category-specific learning and reasoning components. The data-driven guardrail models provide unsafety probabilities of moderated content on different safety categories. We then encode…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

kangmintong/r-2-guard
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsLogic, Reasoning, and Knowledge