"To Survive, I Must Defect": Jailbreaking LLMs via the Game-Theory Scenarios

Zhen Sun; Zongmin Zhang; Deqi Liang; Han Sun; Yule Liu; Yun Shen; Xiangshan Gao; Yilong Yang; Shuai Liu; Yutao Yue; Xinlei He

arXiv:2511.16278·cs.CR·November 21, 2025

"To Survive, I Must Defect": Jailbreaking LLMs via the Game-Theory Scenarios

Zhen Sun, Zongmin Zhang, Deqi Liang, Han Sun, Yule Liu, Yun Shen, Xiangshan Gao, Yilong Yang, Shuai Liu, Yutao Yue, Xinlei He

PDF

Open Access

TL;DR

This paper introduces a scalable game-theoretic framework for black-box jailbreak attacks on language models, demonstrating high success rates and robustness across multiple scenarios and real-world applications.

Contribution

It formalizes the attack as a game-theoretic model, enabling scalable and effective jailbreak strategies that outperform prior heuristic-based methods.

Findings

01

Achieves over 95% attack success rate on multiple LLMs

02

Maintains high efficiency and generalization across scenarios

03

Effective in real-world LLM applications and safety monitoring

Abstract

As LLMs become more common, non-expert users can pose risks, prompting extensive research into jailbreak attacks. However, most existing black-box jailbreak attacks rely on hand-crafted heuristics or narrow search spaces, which limit scalability. Compared with prior attacks, we propose Game-Theory Attack (GTA), an scalable black-box jailbreak framework. Concretely, we formalize the attacker's interaction against safety-aligned LLMs as a finite-horizon, early-stoppable sequential stochastic game, and reparameterize the LLM's randomized outputs via quantal response. Building on this, we introduce a behavioral conjecture "template-over-safety flip": by reshaping the LLM's effective objective through game-theoretic scenarios, the originally safety preference may become maximizing scenario payoffs within the template, which weakens safety constraints in specific contexts. We validate this…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdversarial Robustness in Machine Learning · Advanced Malware Detection Techniques · Information and Cyber Security