You Can't Eat Your Cake and Have It Too: The Performance Degradation of   LLMs with Jailbreak Defense

Wuyuao Mai; Geng Hong; Pei Chen; Xudong Pan; Baojun Liu; Yuan Zhang,; Haixin Duan; Min Yang

arXiv:2501.12210·cs.CR·January 22, 2025

You Can't Eat Your Cake and Have It Too: The Performance Degradation of LLMs with Jailbreak Defense

Wuyuao Mai, Geng Hong, Pei Chen, Xudong Pan, Baojun Liu, Yuan Zhang,, Haixin Duan, Min Yang

PDF

Open Access

TL;DR

This paper investigates how jailbreak defenses impact the performance and safety of large language models, revealing that current strategies often compromise utility and that developers tend to prioritize performance over safety.

Contribution

The study introduces USEBench and USEIndex to evaluate safety-performance trade-offs and provides a comprehensive analysis of defense strategies across multiple LLMs.

Findings

01

Mainstream defenses often fail to balance safety and performance.

02

Model fine-tuning offers the best overall safety-performance trade-off.

03

Developers tend to prioritize performance over safety during model iteration.

Abstract

With the rise of generative large language models (LLMs) like LLaMA and ChatGPT, these models have significantly transformed daily life and work by providing advanced insights. However, as jailbreak attacks continue to circumvent built-in safety mechanisms, exploiting carefully crafted scenarios or tokens, the safety risks of LLMs have come into focus. While numerous defense strategies--such as prompt detection, modification, and model fine-tuning--have been proposed to counter these attacks, a critical question arises: do these defenses compromise the utility and usability of LLMs for legitimate users? Existing research predominantly focuses on the effectiveness of defense strategies without thoroughly examining their impact on performance, leaving a gap in understanding the trade-offs between LLM safety and performance. Our research addresses this gap by conducting a comprehensive…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsCybercrime and Law Enforcement Studies · Digital Rights Management and Security · Digital and Cyber Forensics