Is Your LLM Really Mastering the Concept? A Multi-Agent Benchmark

Shuhang Xu; Weijian Deng; Yixuan Zhou; Fangwei Zhong

arXiv:2505.17512·cs.AI·February 12, 2026

Is Your LLM Really Mastering the Concept? A Multi-Agent Benchmark

Shuhang Xu, Weijian Deng, Yixuan Zhou, Fangwei Zhong

PDF

1 Datasets 3 Reviews

TL;DR

This paper introduces CK-Arena, a dynamic multi-agent benchmark using a social deduction game to evaluate whether large language models truly understand concepts beyond surface patterns.

Contribution

It presents a novel interactive benchmark that probes fine-grained conceptual understanding in LLMs through multi-agent social deduction tasks.

Findings

01

Conceptual understanding varies significantly across models and categories.

02

Performance is not strictly correlated with overall model capability.

03

CK-Arena enables detailed diagnostic analysis of semantic comprehension.

Abstract

Concepts serve as fundamental abstractions that support human reasoning and categorization. However, it remains unclear whether large language models truly capture such conceptual structures or primarily rely on surface-level pattern memorization. Existing benchmarks are largely static and fact oriented, which limits their ability to probe fine-grained semantic understanding and makes them vulnerable to data leakage and overfitting. To address this limitation, we introduce CK-Arena, a dynamic benchmark for conceptual knowledge evaluation based on a multi agent social deduction game, namely the Undercover game. In this setting, LLM based agents are assigned subtly different concept words and must describe, distinguish, and infer conceptual properties from others' statements. Model performance is evaluated through both game level outcomes and the semantic quality of generated…

Peer Reviews

Decision·Submitted to ICLR 2026

Reviewer 01Rating 6Confidence 4

Strengths

This paper provides a dynamic benchmark that can evaluate the conceptualization capability of LLMs in a arena-like setting under the "Undercover" game, offering a new perspective to rank the conceptualization capability of LLMs. The metrics and checking process at each round are relatively fair and comprehensive, making the results convincing. The analysis of the results is thorough, covering both the raw performance and the Elo rating, as well as the qualitative analysis of the distribution o

Weaknesses

Lack of fine-grained case study: since the benchmark is based on the "Undercover" game, the strategies of different LLMs are not explicitly discussed, which can be reflected by some case studies. The evaluation seems to be a bit costly, since it requires multiple rounds with multiple LLM agents. Though the authors have provided some methods to mitigate the cost, it is still a bit time-consuming.

Reviewer 02Rating 4Confidence 4

Strengths

1. The paper explores an important and interesting topic: evaluating LLMs’ conceptual knowledge. Assessing whether models truly understand conceptual knowledge is fundamental to understanding world knowledge, which provides valuable insights for the broader research community. 2. The use of the *Undercover Game* paradigm is interesting. By introducing dynamic evaluation through model interaction, the approach effectively mitigates issues such as benchmark leakage of static evaluations. 3. The ex

Weaknesses

1. The core evaluation module, i.e., the Undercover Game, is adapted from existing work. While applying this framework to a new domain or evaluation is indeed a meaningful contribution, it may somewhat limit the paper’s technical novelty. 2. The paper evaluates a total of 529 English concept pairs, including 220 concrete noun pairs, 100 abstract noun pairs, 109 adverb pairs, and 100 verb pairs. As an evaluation and benchmark work, it would be helpful to provide some validation regarding whether

Reviewer 03Rating 2Confidence 3

Strengths

- Repurposes existing multi-agent game (Undercover) to evaluate conceptual understanding rather than just strategic gameplay - Provides structured evaluation framework with systematic metrics (Novelty, Reasonableness, Relevance) - Uses t-SNE semantic dispersion as proxy for conceptual depth (decent visualization approach, not novel) - Comprehensive evaluation across 14 LLMs with standardized Elo rating system - Some interesting cross-category performance variations (e.g., Claude's verb/noun fli

Weaknesses

- Incremental contribution over existing work: The paper repurposes the Undercover game framework (from previous work). While the stated goal is evaluating "conceptual understanding" rather than "strategy," any language game involving concept description inherently tests conceptual knowledge. The paper does not provide sufficient justification for why this reframing constitutes a distinct contribution beyond running additional experiments with different evaluation metrics. - Tension between gam

Code & Models

Datasets

Xushuhaha/CK-Arena
dataset· 84 dl
84 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.