Mitigating Adversarial Attacks in LLMs through Defensive Suffix   Generation

Minkyoung Kim; Yunha Kim; Hyeram Seo; Heejung Choi; Jiye Han; Gaeun; Kee; Soyoung Ko; HyoJe Jung; Byeolhee Kim; Young-Hak Kim; Sanghyun Park; Tae; Joon Jun

arXiv:2412.13705·cs.CV·December 19, 2024

Mitigating Adversarial Attacks in LLMs through Defensive Suffix Generation

Minkyoung Kim, Yunha Kim, Hyeram Seo, Heejung Choi, Jiye Han, Gaeun, Kee, Soyoung Ko, HyoJe Jung, Byeolhee Kim, Young-Hak Kim, Sanghyun Park, Tae, Joon Jun

PDF

Open Access

TL;DR

This paper introduces a gradient-based defensive suffix generation method to improve the robustness of large language models against adversarial attacks, effectively reducing attack success rates and enhancing output quality.

Contribution

It proposes a novel total loss function for generating defensive suffixes that mitigate adversarial influences without extensive retraining of LLMs.

Findings

01

Reduces attack success rate by 11% on average

02

Decreases perplexity from 6.57 to 3.93 on Gemma-7B

03

Improves Truthfulness scores by up to 10%

Abstract

Large language models (LLMs) have exhibited outstanding performance in natural language processing tasks. However, these models remain susceptible to adversarial attacks in which slight input perturbations can lead to harmful or misleading outputs. A gradient-based defensive suffix generation algorithm is designed to bolster the robustness of LLMs. By appending carefully optimized defensive suffixes to input prompts, the algorithm mitigates adversarial influences while preserving the models' utility. To enhance adversarial understanding, a novel total loss function ( $L_{total}$ ) combining defensive loss ( $L_{def}$ ) and adversarial loss ( $L_{adv}$ ) generates defensive suffixes more effectively. Experimental evaluations conducted on open-source LLMs such as Gemma-7B, mistral-7B, Llama2-7B, and Llama2-13B show that the proposed method reduces attack success rates (ASR)…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHandwritten Text Recognition Techniques · Algorithms and Data Compression · Natural Language Processing Techniques