KORGym: A Dynamic Game Platform for LLM Reasoning Evaluation

Jiajun Shi; Jian Yang; Jiaheng Liu; Xingyuan Bu; Jiangjie Chen; Junting Zhou; Kaijing Ma; Zhoufutu Wen; Bingli Wang; Yancheng He; Liang Song; Hualei Zhu; Shilong Li; Xingjian Wang; Wei Zhang; Ruibin Yuan; Yifan Yao; Wenjun Yang; Yunli Wang; Siyuan Fang; Siyu Yuan; Qianyu He; Xiangru Tang; Yingshui Tan; Wangchunshu Zhou; Zhaoxiang Zhang; Zhoujun Li; Wenhao Huang; Ge Zhang

arXiv:2505.14552·cs.CL·May 22, 2025

KORGym: A Dynamic Game Platform for LLM Reasoning Evaluation

Jiajun Shi, Jian Yang, Jiaheng Liu, Xingyuan Bu, Jiangjie Chen, Junting Zhou, Kaijing Ma, Zhoufutu Wen, Bingli Wang, Yancheng He, Liang Song, Hualei Zhu, Shilong Li, Xingjian Wang, Wei Zhang, Ruibin Yuan, Yifan Yao, Wenjun Yang, Yunli Wang, Siyuan Fang, Siyu Yuan, Qianyu He

PDF

Open Access 1 Repo 1 Video

TL;DR

KORGym is a versatile, interactive evaluation platform with over fifty games designed to assess and analyze the reasoning capabilities of large language and vision models in complex, multi-turn scenarios.

Contribution

We introduce KORGym, a novel dynamic evaluation platform with diverse interactive games for comprehensive reasoning assessment of LLMs and VLMs.

Findings

01

Closed-source models outperform open-source ones.

02

Model performance varies with modality and reasoning strategies.

03

Reinforcement learning impacts model reasoning abilities.

Abstract

Recent advancements in large language models (LLMs) underscore the need for more comprehensive evaluation methods to accurately assess their reasoning capabilities. Existing benchmarks are often domain-specific and thus cannot fully capture an LLM's general reasoning potential. To address this limitation, we introduce the Knowledge Orthogonal Reasoning Gymnasium (KORGym), a dynamic evaluation platform inspired by KOR-Bench and Gymnasium. KORGym offers over fifty games in either textual or visual formats and supports interactive, multi-turn assessments with reinforcement learning scenarios. Using KORGym, we conduct extensive experiments on 19 LLMs and 8 VLMs, revealing consistent reasoning patterns within model families and demonstrating the superior performance of closed-source models. Further analysis examines the effects of modality, reasoning strategies, reinforcement learning…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

multimodal-art-projection/korgym
noneOfficial

Videos

KORGym: A Dynamic Game Platform for LLM Reasoning Evaluation· slideslive

Taxonomy

TopicsTopic Modeling · Multimodal Machine Learning Applications · Explainable Artificial Intelligence (XAI)