PrimeGuard: Safe and Helpful LLMs through Tuning-Free Routing

Blazej Manczak; Eliott Zemour; Eric Lin; Vaikkunth Mugunthan

arXiv:2407.16318·cs.AI·July 24, 2024

PrimeGuard: Safe and Helpful LLMs through Tuning-Free Routing

Blazej Manczak, Eliott Zemour, Eric Lin, Vaikkunth Mugunthan

PDF

1 Repo 1 Datasets

TL;DR

PrimeGuard introduces a tuning-free routing method for language models that enhances safety and helpfulness simultaneously by dynamically directing requests to different model instantiations, overcoming the traditional safety-helpfulness trade-off.

Contribution

It proposes PrimeGuard, a novel routing approach that improves safety and helpfulness without fine-tuning, using structured control flow and in-context learning.

Findings

01

Significantly increases resistance to jailbreak attacks.

02

Achieves state-of-the-art safety guardrailing results.

03

Maintains helpfulness scores comparable to fine-tuned models.

Abstract

Deploying language models (LMs) necessitates outputs to be both high-quality and compliant with safety guidelines. Although Inference-Time Guardrails (ITG) offer solutions that shift model output distributions towards compliance, we find that current methods struggle in balancing safety with helpfulness. ITG Methods that safely address non-compliant queries exhibit lower helpfulness while those that prioritize helpfulness compromise on safety. We refer to this trade-off as the guardrail tax, analogous to the alignment tax. To address this, we propose PrimeGuard, a novel ITG method that utilizes structured control flow. PrimeGuard routes requests to different self-instantiations of the LM with varying instructions, leveraging its inherent instruction-following capabilities and in-context learning. Our tuning-free approach dynamically compiles system-designer guidelines for each query.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

dynamofl/primeguard
noneOfficial

Datasets

dynamoai/safe_eval
dataset· 30 dl
30 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.