Safeguarding Large Language Models in Real-time with Tunable   Safety-Performance Trade-offs

Joao Fonseca; Andrew Bell; Julia Stoyanovich

arXiv:2501.02018·cs.CL·January 7, 2025

Safeguarding Large Language Models in Real-time with Tunable Safety-Performance Trade-offs

Joao Fonseca, Andrew Bell, Julia Stoyanovich

PDF

Open Access

TL;DR

This paper introduces SafeNudge, a real-time safeguard for LLMs that reduces jailbreak success by 30% with minimal latency and tunable safety-performance trade-offs, enhancing model safety without significantly affecting output quality.

Contribution

SafeNudge is a novel method combining controlled text generation and nudging to improve LLM safety during inference with tunable trade-offs.

Findings

01

Reduces jailbreak success rate by 30%.

02

Adds minimal latency to inference.

03

Maintains semantic fluency of outputs.

Abstract

Large Language Models (LLMs) have been shown to be susceptible to jailbreak attacks, or adversarial attacks used to illicit high risk behavior from a model. Jailbreaks have been exploited by cybercriminals and blackhat actors to cause significant harm, highlighting the critical need to safeguard widely-deployed models. Safeguarding approaches, which include fine-tuning models or having LLMs "self-reflect", may lengthen the inference time of a model, incur a computational penalty, reduce the semantic fluency of an output, and restrict ``normal'' model behavior. Importantly, these Safety-Performance Trade-offs (SPTs) remain an understudied area. In this work, we introduce a novel safeguard, called SafeNudge, that combines Controlled Text Generation with "nudging", or using text interventions to change the behavior of a model. SafeNudge triggers during text-generation while a jailbreak…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAccess Control and Trust · Privacy-Preserving Technologies in Data · Topic Modeling