Safeguarding Large Language Models in Real-time with Tunable Safety-Performance Trade-offs
Joao Fonseca, Andrew Bell, Julia Stoyanovich

TL;DR
This paper introduces SafeNudge, a real-time safeguard for LLMs that reduces jailbreak success by 30% with minimal latency and tunable safety-performance trade-offs, enhancing model safety without significantly affecting output quality.
Contribution
SafeNudge is a novel method combining controlled text generation and nudging to improve LLM safety during inference with tunable trade-offs.
Findings
Reduces jailbreak success rate by 30%.
Adds minimal latency to inference.
Maintains semantic fluency of outputs.
Abstract
Large Language Models (LLMs) have been shown to be susceptible to jailbreak attacks, or adversarial attacks used to illicit high risk behavior from a model. Jailbreaks have been exploited by cybercriminals and blackhat actors to cause significant harm, highlighting the critical need to safeguard widely-deployed models. Safeguarding approaches, which include fine-tuning models or having LLMs "self-reflect", may lengthen the inference time of a model, incur a computational penalty, reduce the semantic fluency of an output, and restrict ``normal'' model behavior. Importantly, these Safety-Performance Trade-offs (SPTs) remain an understudied area. In this work, we introduce a novel safeguard, called SafeNudge, that combines Controlled Text Generation with "nudging", or using text interventions to change the behavior of a model. SafeNudge triggers during text-generation while a jailbreak…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAccess Control and Trust · Privacy-Preserving Technologies in Data · Topic Modeling
