Reasoning Capabilities of Large Language Models. Lessons Learned from General Game Playing
Maciej \'Swiechowski, Adam \.Zychowski, Jacek Ma\'ndziuk

TL;DR
This paper evaluates the reasoning abilities of large language models within rule-based environments using General Game Playing tasks, revealing their strengths, limitations, and common errors in formal reasoning.
Contribution
It introduces a comprehensive analysis of LLM reasoning in formal, rule-based settings, highlighting performance patterns, structural game features, and reasoning errors.
Findings
Models perform well across most tasks
Performance decreases with longer game horizons
Common errors include hallucinated rules and syntactic mistakes
Abstract
This paper examines the reasoning capabilities of Large Language Models (LLMs) from a novel perspective, focusing on their ability to operate within formally specified, rule-governed environments. We evaluate four LLMs (Gemini 2.5 Pro and Flash variants, Llama 3.3 70B and GPT-OSS 120B) on a suite of forward-simulation tasks-including next / multistep state formulation, and legal action generation-across a diverse set of reasoning problems illustrated through General Game Playing (GGP) game instances. Beyond reporting instance-level performance, we characterize games based on 40 structural features and analyze correlations between these features and LLM performance. Furthermore, we investigate the effects of various game obfuscations to assess the role of linguistic semantics in game definitions and the impact of potential prior exposure of LLMs to specific games during training. The…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Artificial Intelligence in Healthcare and Education · Natural Language Processing Techniques
