SEAS: Self-Evolving Adversarial Safety Optimization for Large Language Models
Muxi Diao, Rumei Li, Shiyang Liu, Guogang Liao, Jingang Wang, Xunliang, Cai, Weiran Xu

TL;DR
SEAS is a novel iterative framework that enhances large language model security by self-evolving adversarial prompts, significantly improving robustness and reducing manual testing reliance.
Contribution
Introduces a self-evolving adversarial safety framework that iteratively refines attack and defense models to improve LLM security and safety.
Findings
Target model security level comparable to GPT-4 after iterations
Red Team attack success rate significantly increased
Framework reduces manual testing efforts
Abstract
As large language models (LLMs) continue to advance in capability and influence, ensuring their security and preventing harmful outputs has become crucial. A promising approach to address these concerns involves training models to automatically generate adversarial prompts for red teaming. However, the evolving subtlety of vulnerabilities in LLMs challenges the effectiveness of current adversarial methods, which struggle to specifically target and explore the weaknesses of these models. To tackle these challenges, we introduce the optimization framework, which enhances security by leveraging data generated by the model itself. SEAS operates through three iterative stages: Initialization, Attack, and Adversarial Optimization, refining both the Red Team and Target models to…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdversarial Robustness in Machine Learning · Topic Modeling · Natural Language Processing Techniques
MethodsAttention Is All You Need · Linear Layer · Layer Normalization · Multi-Head Attention · Position-Wise Feed-Forward Layer · Adam · Byte Pair Encoding · Softmax · Absolute Position Encodings · Dense Connections
