Model-Agnostic Lifelong LLM Safety via Externalized Attack-Defense Co-Evolution
Xiaozhe Zhang, Chaozhuo Li, Hui Liu, Shaocheng Yan, Bingyu Yan, Qiwei Ye, Haoliang Li

TL;DR
EvoSafety introduces a model-agnostic, externalized attack-defense co-evolution framework for LLM safety, enabling continuous vulnerability probing and efficient, transferable safety improvements.
Contribution
The paper presents EvoSafety, a novel safety framework that decouples attack and defense mechanisms, allowing persistent vulnerability exploration and lightweight, transferable safety enhancements.
Findings
Achieves 99.61% defense success rate in Guard mode.
Outperforms Qwen3Guard-8B by 14.13% with fewer parameters.
Maintains reasoning performance on benign queries.
Abstract
Large language models remain vulnerable to adversarial prompts that elicit harmful outputs. Existing safety paradigms typically couple red-teaming and post-training in a closed, policy-centric loop, causing attack discovery to suffer from rapid saturation and limiting the exposure of novel failure modes, while leaving defenses inefficient, rigid, and difficult to transfer across victim models. To this end, we propose EvoSafety, an LLM safety framework built around persistent, inspectable, and reusable external structures. For red teaming, EvoSafety equips the attack policy with an adversarial skill library, enabling continued vulnerability probing through simple library expansion after saturation, while supporting the evolution of adversarial vectors. For defense learning, EvoSafety replaces model-specific safety fine-tuning with a lightweight auxiliary defense model augmented with…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
