Metis: Learning to Jailbreak LLMs via Self-Evolving Metacognitive Policy Optimization
Huilin Zhou, Jian Zhao, Yilu Zhong, Zhen Liang, Xiuyuan Chen, Yuchen Yuan, Tianle Zhang, Chi Zhang, Lan Zhang, Xuelong Li

TL;DR
Metis introduces a self-evolving, policy-optimization framework for more effective and interpretable jailbreaking of large language models, outperforming existing methods in success rate and efficiency.
Contribution
It reformulates jailbreaking as inference-time policy optimization within an adversarial POMDP, incorporating a self-evolving metacognitive loop for causal diagnosis and feedback-driven policy refinement.
Findings
Achieves 89.2% average attack success rate across 10 models.
Reduces token costs by up to 11.4x compared to baselines.
Vulnerable to internally-steered, closed-loop reasoning trajectories.
Abstract
Red teaming is critical for uncovering vulnerabilities in Large Language Models (LLMs). While automated methods have improved scalability, existing approaches often rely on static heuristics or stochastic search, rendering them brittle against advanced safety alignment. To address this, we introduce Metis, a framework that reformulates jailbreaking as inference-time policy optimization within an adversarial Partially Observable Markov Decision Process (POMDP). Metis employs a self-evolving metacognitive loop to perform causal diagnosis of a target's defense logic and leverages structured feedback as a semantic gradient to refine its policy, offering enhanced interpretability through transparent reasoning traces. Extensive evaluations across 10 diverse models demonstrate that Metis achieves the strongest average Attack Success Rate (ASR) among compared methods at 89.2%, maintaining high…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
