Evaluating from Benign to Dynamic Adversarial: A Squid Game for Large Language Models
Zijian Chen, Wenjun Zhang, Guangtao Zhai

TL;DR
This paper introduces Squid Game, a dynamic, adversarial evaluation environment for large language models that tests their abilities under pressure and resource constraints, revealing insights into model behavior and evaluation robustness.
Contribution
It presents a novel interactive, adversarial benchmarking framework for LLMs, addressing static benchmark limitations and exploring model performance in dynamic, resource-limited scenarios.
Findings
Performance shows a generational phase transition.
Some models use speculative shortcuts to win.
Dynamic evaluation complements static benchmarks.
Abstract
The potential data contamination issue in contemporary large language models (LLMs) benchmarks presents a fundamental challenge to establishing trustworthy evaluation frameworks. Meanwhile, they predominantly assume benign, resource-rich settings, leaving the behavior of LLMs under pressure unexplored. In this paper, we introduce \textsc{Squid Game}, a dynamic and adversarial evaluation environment with resource-constrained and asymmetric information settings elaborated to evaluate LLMs through interactive gameplay against other LLM opponents. Squid Game consists of six elimination-style levels, focusing on multi-faceted abilities, including instruction-following, code, reasoning, planning, and safety alignment. We evaluate over 50 LLMs on Squid Game, presenting the largest behavioral evaluation study of general LLMs on dynamic adversarial scenarios. We observe a clear generational…
Peer Reviews
Decision·ICLR 2026 Conference Withdrawn Submission
* Devising novel dynamic evaluation methods is an important area, given that training data contamination and inaccessibility of the training data of many current models makes it difficult to assess which abilities result from memorization and which ones result from generalization abilities. * The paper evaluates a wide range of LLMs.
* My general impression is that the authors focused most on their energy to come up with tasks that could be applied to LLMs and that mimic tasks on the Netflix show "Squid Game". While it can be beneficial to take inspiration from very different domains, the current work fails to motivate why these rounds are reasonable. * It is unclear what this benchmark is supposed to highlight. Given the multitude of different LLMs, many of which have been optimized for different aspects, it seems misguided
- The core idea of using a dynamic, adversarial tournament for LLM evaluation is highly novel, creative, and engaging. - The paper conveys a sense of enthusiasm for the project, suggesting a highly motivated research effort.
* **Missing Summative Ranking:** The paper's "Squid Game" theme creates a strong expectation for a final tournament outcome. A significant weakness is the lack of a clear figure or table showing the **final ranking** of all evaluated models, or a leaderboard aggregated over the 20 independent runs. Presenting a clear winner and runner-ups would be a crucial and satisfying addition to complete the paper's narrative. * **Insufficient Qualitative Analysis:** The paper is currently focused on quanti
- Squid Game is an interesting idea with strong motivations. For example, many static benchmarks are nearing saturation with top models, and data contamination can inflate scores. Squid Game directly tackles this by introducing information asymmetry and dynamic adversarial evaluation. The benchmark shifts the question from “What does the model know?” to “How does the model act under uncertainty and competition?”. This perspective is fresh. - Squid Game’s six levels are carefully chosen to exerci
- The Squid Game benchmark introduces a fairly complex evaluation setup with multiple games and roles. The paper would benefit from clearer descriptions of each game’s rules and scoring. For example, how exactly is a “win” determined in the Tug-of-War debate, or what constitutes success in the Marbles game? A more explicit explanation of the evaluation criteria for each level would improve clarity. - By design, this benchmark is resource-heavy. Running head-to-head evaluations on 50+ models with
- Innovative Evaluation Paradigm: Most existing LLM benchmarks are static and benign, while this work’s SQUID GAME pioneers an "elimination + dynamic adversarial" framework, filling the gap in dynamic adversarial LLM evaluation. - Large-Scale Experiments: It evaluates 52 LLMs (28 proprietary like GPT-5/Gemini 2.5 Pro; 24 open-source like Qwen3/DeepSeek), the largest behavioral study in dynamic adversarial scenarios to date. It also identifies model exploitation of evaluation loopholes, providing
- Bias from Linear Elimination Order:SQUID GAME adopts a strict linear elimination sequence, where participants in subsequent levels are entirely determined by the survivors of the previous level. While this design simulates dynamic competition, it may introduce biases in evaluating the comprehensive capabilities of LLMs, specifically in two aspects: - Early levels focus on basic capabilities, which may eliminate models that perform poorly in basic tasks but excel in advanced capabilities. Thi
1. Recasting LLM evaluation as an elimination-style, adversarial “game” offers a fresh, metaphor-driven approach to stress-testing model capabilities. 2. Experiments with 52 models provide rich comparative data, allowing broad insights into model behavior under dynamic constraints. 3. The paper is well-structured, figures are effective, and the motivation and empirical sections are easy to follow.
1. Although the paper introduces six interactive games, it provides insufficient formalization or reproducible implementation details, making it difficult for others to replicate or validate the experiments. 2. Correlation plots with existing benchmarks (e.g., LIVEBENCH, CHATBOT ARENA) remain anecdotal. The paper lacks formal statistical analysis (e.g., significance testing, regression, controlled ablations) and does not report error bars or variance beyond simple averages. 3. The paper provid
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Artificial Intelligence in Healthcare and Education · Natural Language Processing Techniques
