Fine-Grained and Thematic Evaluation of LLMs in Social Deduction Game
Byungjun Kim, Dayeon Seo, Minju Kim, Bugeun Kim

TL;DR
This paper introduces a detailed evaluation framework for large language models in social deduction games, highlighting their reasoning failures and providing more nuanced insights than previous coarse metrics.
Contribution
It presents six fine-grained metrics and a thematic analysis approach to better evaluate LLMs' performance in social deduction tasks.
Findings
Identified four major reasoning failures in LLMs
Developed six detailed evaluation metrics
Highlighted limitations of previous coarse-grained assessments
Abstract
Recent studies have investigated whether large language models (LLMs) can support obscured communication, which is characterized by core aspects such as inferring subtext and evading suspicions. To conduct the investigation, researchers have used social deduction games (SDGs) as their experimental environment, in which players conceal and infer specific information. However, prior work has often overlooked how LLMs should be evaluated in such settings. Specifically, we point out two limitations with the evaluation methods they employed. First, metrics used in prior studies are coarse-grained as they are based on overall game outcomes that often fail to capture event-level behaviors; Second, error analyses have lacked structured methodologies capable of producing insights that meaningfully support evaluation outcomes. To address these limitations, we propose a microscopic and systematic…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsDigital Games and Media
