Adversarial Attack-Defense Co-Evolution for LLM Safety Alignment via Tree-Group Dual-Aware Search and Optimization
Xurui Li, Kaisong Song, Rui Zhu, Pin-Yu Chen, Haixu Tang

TL;DR
This paper introduces ACE-Safety, a co-evolutionary framework for improving LLM safety by jointly optimizing attack and defense models through innovative search and reinforcement learning techniques, addressing societal risks.
Contribution
The paper presents a novel co-evolutionary approach combining GS-MCTS and AC-TGPO for dynamic attack-defense optimization in LLM safety, a significant advancement over static methods.
Findings
Outperforms existing attack and defense methods on multiple benchmarks.
Effectively uncovers vulnerabilities and enhances robustness of LLMs.
Demonstrates sustainable development of safer LLMs in real-world scenarios.
Abstract
Large Language Models (LLMs) have developed rapidly in web services, delivering unprecedented capabilities while amplifying societal risks. Existing works tend to focus on either isolated jailbreak attacks or static defenses, neglecting the dynamic interplay between evolving threats and safeguards in real-world web contexts. To mitigate these challenges, we propose ACE-Safety (Adversarial Co-Evolution for LLM Safety), a novel framework that jointly optimize attack and defense models by seamlessly integrating two key innovative procedures: (1) Group-aware Strategy-guided Monte Carlo Tree Search (GS-MCTS), which efficiently explores jailbreak strategies to uncover vulnerabilities and generate diverse adversarial samples; (2) Adversarial Curriculum Tree-aware Group Policy Optimization (AC-TGPO), which jointly trains attack and defense LLMs with challenging samples via curriculum…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdversarial Robustness in Machine Learning · Information and Cyber Security · Advanced Malware Detection Techniques
