Defending Large Language Models against Jailbreak Attacks via Semantic   Smoothing

Jiabao Ji; Bairu Hou; Alexander Robey; George J. Pappas; Hamed; Hassani; Yang Zhang; Eric Wong; Shiyu Chang

arXiv:2402.16192·cs.CL·March 1, 2024·2 cites

Defending Large Language Models against Jailbreak Attacks via Semantic Smoothing

Jiabao Ji, Bairu Hou, Alexander Robey, George J. Pappas, Hamed, Hassani, Yang Zhang, Eric Wong, Shiyu Chang

PDF

Open Access 1 Repo

TL;DR

This paper introduces SEMANTICSMOOTH, a novel defense method that enhances large language models' robustness against semantic jailbreak attacks by aggregating predictions over semantically transformed inputs, achieving state-of-the-art results.

Contribution

We propose SEMANTICSMOOTH, a smoothing-based defense that improves robustness against semantic jailbreak attacks without sacrificing nominal performance.

Findings

01

Achieves state-of-the-art robustness against GCG, PAIR, and AutoDAN attacks.

02

Maintains strong performance on instruction following benchmarks.

03

Codes will be publicly available.

Abstract

Aligned large language models (LLMs) are vulnerable to jailbreaking attacks, which bypass the safeguards of targeted LLMs and fool them into generating objectionable content. While initial defenses show promise against token-based threat models, there do not exist defenses that provide robustness against semantic attacks and avoid unfavorable trade-offs between robustness and nominal performance. To meet this need, we propose SEMANTICSMOOTH, a smoothing-based defense that aggregates the predictions of multiple semantically transformed copies of a given input prompt. Experimental results demonstrate that SEMANTICSMOOTH achieves state-of-the-art robustness against GCG, PAIR, and AutoDAN attacks while maintaining strong nominal performance on instruction following benchmarks such as InstructionFollowing and AlpacaEval. The codes will be publicly available at…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

ucsb-nlp-chang/semanticsmooth
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdversarial Robustness in Machine Learning · Hate Speech and Cyberbullying Detection · Digital and Cyber Forensics