Metis: Learning to Jailbreak LLMs via Self-Evolving Metacognitive Policy Optimization

Huilin Zhou; Jian Zhao; Yilu Zhong; Zhen Liang; Xiuyuan Chen; Yuchen Yuan; Tianle Zhang; Chi Zhang; Lan Zhang; Xuelong Li

arXiv:2605.10067·cs.LG·May 22, 2026

Metis: Learning to Jailbreak LLMs via Self-Evolving Metacognitive Policy Optimization

Huilin Zhou, Jian Zhao, Yilu Zhong, Zhen Liang, Xiuyuan Chen, Yuchen Yuan, Tianle Zhang, Chi Zhang, Lan Zhang, Xuelong Li

PDF

TL;DR

Metis introduces a self-evolving, policy-optimization framework for more effective and interpretable jailbreaking of large language models, outperforming existing methods in success rate and efficiency.

Contribution

It reformulates jailbreaking as inference-time policy optimization within an adversarial POMDP, incorporating a self-evolving metacognitive loop for causal diagnosis and feedback-driven policy refinement.

Findings

01

Achieves 89.2% average attack success rate across 10 models.

02

Reduces token costs by up to 11.4x compared to baselines.

03

Vulnerable to internally-steered, closed-loop reasoning trajectories.

Abstract

Red teaming is critical for uncovering vulnerabilities in Large Language Models (LLMs). While automated methods have improved scalability, existing approaches often rely on static heuristics or stochastic search, rendering them brittle against advanced safety alignment. To address this, we introduce Metis, a framework that reformulates jailbreaking as inference-time policy optimization within an adversarial Partially Observable Markov Decision Process (POMDP). Metis employs a self-evolving metacognitive loop to perform causal diagnosis of a target's defense logic and leverages structured feedback as a semantic gradient to refine its policy, offering enhanced interpretability through transparent reasoning traces. Extensive evaluations across 10 diverse models demonstrate that Metis achieves the strongest average Attack Success Rate (ASR) among compared methods at 89.2%, maintaining high…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.