TTT-Bench: A Benchmark for Evaluating Reasoning Ability with Simple and Novel Tic-Tac-Toe-style Games
Prakamya Mishra, Jiang Liu, Jialian Wu, Xiaodong Yu, Zicheng Liu, Emad Barsoum

TL;DR
This paper introduces TTT-Bench, a new benchmark using simple Tic-Tac-Toe-style games to evaluate reasoning abilities of large reasoning models, revealing they often struggle with basic strategic reasoning despite excelling at complex math problems.
Contribution
The paper presents TTT-Bench, a scalable, verifiable benchmark for assessing basic reasoning skills in LRMs through simple yet challenging Tic-Tac-Toe-style games, highlighting gaps in current models.
Findings
Models excel at complex math but fail at simple reasoning tasks.
Performance drops by 41% compared to math benchmarks.
Larger models perform better with shorter reasoning traces.
Abstract
Large reasoning models (LRMs) have demonstrated impressive reasoning capabilities across a broad range of tasks including Olympiad-level mathematical problems, indicating evidence of their complex reasoning abilities. While many reasoning benchmarks focus on the STEM domain, the ability of LRMs to reason correctly in broader task domains remains underexplored. In this work, we introduce \textbf{TTT-Bench}, a new benchmark that is designed to evaluate basic strategic, spatial, and logical reasoning abilities in LRMs through a suite of four two-player Tic-Tac-Toe-style games that humans can effortlessly solve from a young age. We propose a simple yet scalable programmatic approach for generating verifiable two-player game problems for TTT-Bench. Although these games are trivial for humans, they require reasoning about the intentions of the opponent, as well as the game board's spatial…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsArtificial Intelligence in Games · Explainable Artificial Intelligence (XAI) · Multimodal Machine Learning Applications
MethodsFocus · Sparse Evolutionary Training
