Evaluating from Benign to Dynamic Adversarial: A Squid Game for Large Language Models

Zijian Chen; Wenjun Zhang; Guangtao Zhai

arXiv:2511.10691·cs.CL·February 2, 2026

Evaluating from Benign to Dynamic Adversarial: A Squid Game for Large Language Models

Zijian Chen, Wenjun Zhang, Guangtao Zhai

PDF

Open Access 5 Reviews

TL;DR

This paper introduces Squid Game, a dynamic, adversarial evaluation environment for large language models that tests their abilities under pressure and resource constraints, revealing insights into model behavior and evaluation robustness.

Contribution

It presents a novel interactive, adversarial benchmarking framework for LLMs, addressing static benchmark limitations and exploring model performance in dynamic, resource-limited scenarios.

Findings

01

Performance shows a generational phase transition.

02

Some models use speculative shortcuts to win.

03

Dynamic evaluation complements static benchmarks.

Abstract

The potential data contamination issue in contemporary large language models (LLMs) benchmarks presents a fundamental challenge to establishing trustworthy evaluation frameworks. Meanwhile, they predominantly assume benign, resource-rich settings, leaving the behavior of LLMs under pressure unexplored. In this paper, we introduce \textsc{Squid Game}, a dynamic and adversarial evaluation environment with resource-constrained and asymmetric information settings elaborated to evaluate LLMs through interactive gameplay against other LLM opponents. Squid Game consists of six elimination-style levels, focusing on multi-faceted abilities, including instruction-following, code, reasoning, planning, and safety alignment. We evaluate over 50 LLMs on Squid Game, presenting the largest behavioral evaluation study of general LLMs on dynamic adversarial scenarios. We observe a clear generational…

Peer Reviews

Decision·ICLR 2026 Conference Withdrawn Submission

Reviewer 01Rating 2Confidence 4

Strengths

* Devising novel dynamic evaluation methods is an important area, given that training data contamination and inaccessibility of the training data of many current models makes it difficult to assess which abilities result from memorization and which ones result from generalization abilities. * The paper evaluates a wide range of LLMs.

Weaknesses

* My general impression is that the authors focused most on their energy to come up with tasks that could be applied to LLMs and that mimic tasks on the Netflix show "Squid Game". While it can be beneficial to take inspiration from very different domains, the current work fails to motivate why these rounds are reasonable. * It is unclear what this benchmark is supposed to highlight. Given the multitude of different LLMs, many of which have been optimized for different aspects, it seems misguided

Reviewer 02Rating 2Confidence 4

Strengths

- The core idea of using a dynamic, adversarial tournament for LLM evaluation is highly novel, creative, and engaging. - The paper conveys a sense of enthusiasm for the project, suggesting a highly motivated research effort.

Weaknesses

* **Missing Summative Ranking:** The paper's "Squid Game" theme creates a strong expectation for a final tournament outcome. A significant weakness is the lack of a clear figure or table showing the **final ranking** of all evaluated models, or a leaderboard aggregated over the 20 independent runs. Presenting a clear winner and runner-ups would be a crucial and satisfying addition to complete the paper's narrative. * **Insufficient Qualitative Analysis:** The paper is currently focused on quanti

Reviewer 03Rating 6Confidence 2

Strengths

- Squid Game is an interesting idea with strong motivations. For example, many static benchmarks are nearing saturation with top models, and data contamination can inflate scores. Squid Game directly tackles this by introducing information asymmetry and dynamic adversarial evaluation. The benchmark shifts the question from “What does the model know?” to “How does the model act under uncertainty and competition?”. This perspective is fresh. - Squid Game’s six levels are carefully chosen to exerci

Weaknesses

- The Squid Game benchmark introduces a fairly complex evaluation setup with multiple games and roles. The paper would benefit from clearer descriptions of each game’s rules and scoring. For example, how exactly is a “win” determined in the Tug-of-War debate, or what constitutes success in the Marbles game? A more explicit explanation of the evaluation criteria for each level would improve clarity. - By design, this benchmark is resource-heavy. Running head-to-head evaluations on 50+ models with

Reviewer 04Rating 4Confidence 4

Strengths

- Innovative Evaluation Paradigm: Most existing LLM benchmarks are static and benign, while this work’s SQUID GAME pioneers an "elimination + dynamic adversarial" framework, filling the gap in dynamic adversarial LLM evaluation. - Large-Scale Experiments: It evaluates 52 LLMs (28 proprietary like GPT-5/Gemini 2.5 Pro; 24 open-source like Qwen3/DeepSeek), the largest behavioral study in dynamic adversarial scenarios to date. It also identifies model exploitation of evaluation loopholes, providing

Weaknesses

- Bias from Linear Elimination Order:SQUID GAME adopts a strict linear elimination sequence, where participants in subsequent levels are entirely determined by the survivors of the previous level. While this design simulates dynamic competition, it may introduce biases in evaluating the comprehensive capabilities of LLMs, specifically in two aspects: - Early levels focus on basic capabilities, which may eliminate models that perform poorly in basic tasks but excel in advanced capabilities. Thi

Reviewer 05Rating 2Confidence 5

Strengths

1. Recasting LLM evaluation as an elimination-style, adversarial “game” offers a fresh, metaphor-driven approach to stress-testing model capabilities. 2. Experiments with 52 models provide rich comparative data, allowing broad insights into model behavior under dynamic constraints. 3. The paper is well-structured, figures are effective, and the motivation and empirical sections are easy to follow.

Weaknesses

1. Although the paper introduces six interactive games, it provides insufficient formalization or reproducible implementation details, making it difficult for others to replicate or validate the experiments. 2. Correlation plots with existing benchmarks (e.g., LIVEBENCH, CHATBOT ARENA) remain anecdotal. The paper lacks formal statistical analysis (e.g., significance testing, regression, controlled ablations) and does not report error bars or variance beyond simple averages. 3. The paper provid

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Artificial Intelligence in Healthcare and Education · Natural Language Processing Techniques