MAGIC: A Co-Evolving Attacker-Defender Adversarial Game for Robust LLM Safety

Xiaoyu Wen; Zhida He; Han Qi; Ziyu Wan; Zhongtian Ma; Ying Wen; Tianhang Zheng; Xingcheng Xu; Chaochao Lu; Qiaosheng Zhang

arXiv:2602.01539·cs.AI·February 9, 2026

MAGIC: A Co-Evolving Attacker-Defender Adversarial Game for Robust LLM Safety

Xiaoyu Wen, Zhida He, Han Qi, Ziyu Wan, Zhongtian Ma, Ying Wen, Tianhang Zheng, Xingcheng Xu, Chaochao Lu, Qiaosheng Zhang

PDF

Open Access 4 Models

TL;DR

MAGIC introduces a dynamic multi-agent reinforcement learning framework that models LLM safety as an adversarial game, enabling the detection and defense against evolving, unseen prompt attacks to improve robustness.

Contribution

This work presents a novel co-evolving adversarial game framework for LLM safety, capturing dynamic attack-defense interactions and uncovering new attack strategies through reinforcement learning.

Findings

01

Outperforms existing defenses in success rate against adversarial prompts

02

Attacker develops novel combinatorial attack strategies

03

Framework provides theoretical safety guarantees

Abstract

Ensuring robust safety alignment is crucial for Large Language Models (LLMs), yet existing defenses often lag behind evolving adversarial attacks due to their \textbf{reliance on static, pre-collected data distributions}. In this paper, we introduce \textbf{MAGIC}, a novel multi-turn multi-agent reinforcement learning framework that formulates LLM safety alignment as an adversarial asymmetric game. Specifically, an attacker agent learns to iteratively rewrite original queries into deceptive prompts, while a defender agent simultaneously optimizes its policy to recognize and refuse such inputs. This dynamic process triggers a \textbf{co-evolution}, where the attacker's ever-changing strategies continuously uncover long-tail vulnerabilities, driving the defender to generalize to unseen attack patterns. Remarkably, we observe that the attacker, endowed with initial reasoning ability,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdversarial Robustness in Machine Learning · Topic Modeling · Explainable Artificial Intelligence (XAI)