SEAS: Self-Evolving Adversarial Safety Optimization for Large Language   Models

Muxi Diao; Rumei Li; Shiyang Liu; Guogang Liao; Jingang Wang; Xunliang; Cai; Weiran Xu

arXiv:2408.02632·cs.CL·December 24, 2024

SEAS: Self-Evolving Adversarial Safety Optimization for Large Language Models

Muxi Diao, Rumei Li, Shiyang Liu, Guogang Liao, Jingang Wang, Xunliang, Cai, Weiran Xu

PDF

Open Access 1 Repo 1 Datasets

TL;DR

SEAS is a novel iterative framework that enhances large language model security by self-evolving adversarial prompts, significantly improving robustness and reducing manual testing reliance.

Contribution

Introduces a self-evolving adversarial safety framework that iteratively refines attack and defense models to improve LLM security and safety.

Findings

01

Target model security level comparable to GPT-4 after iterations

02

Red Team attack success rate significantly increased

03

Framework reduces manual testing efforts

Abstract

As large language models (LLMs) continue to advance in capability and influence, ensuring their security and preventing harmful outputs has become crucial. A promising approach to address these concerns involves training models to automatically generate adversarial prompts for red teaming. However, the evolving subtlety of vulnerabilities in LLMs challenges the effectiveness of current adversarial methods, which struggle to specifically target and explore the weaknesses of these models. To tackle these challenges, we introduce the $S elf- E volving A dversarial S afety (SEAS)$ optimization framework, which enhances security by leveraging data generated by the model itself. SEAS operates through three iterative stages: Initialization, Attack, and Adversarial Optimization, refining both the Red Team and Target models to…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

leondiao0427/seas
pytorch

Datasets

diaomuxi/SEAS
dataset· 34 dl
34 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdversarial Robustness in Machine Learning · Topic Modeling · Natural Language Processing Techniques

MethodsAttention Is All You Need · Linear Layer · Layer Normalization · Multi-Head Attention · Position-Wise Feed-Forward Layer · Adam · Byte Pair Encoding · Softmax · Absolute Position Encodings · Dense Connections