TTT-Bench: A Benchmark for Evaluating Reasoning Ability with Simple and Novel Tic-Tac-Toe-style Games

Prakamya Mishra; Jiang Liu; Jialian Wu; Xiaodong Yu; Zicheng Liu; Emad Barsoum

arXiv:2506.10209·cs.CL·June 13, 2025

TTT-Bench: A Benchmark for Evaluating Reasoning Ability with Simple and Novel Tic-Tac-Toe-style Games

Prakamya Mishra, Jiang Liu, Jialian Wu, Xiaodong Yu, Zicheng Liu, Emad Barsoum

PDF

Open Access 2 Models 1 Datasets 1 Video

TL;DR

This paper introduces TTT-Bench, a new benchmark using simple Tic-Tac-Toe-style games to evaluate reasoning abilities of large reasoning models, revealing they often struggle with basic strategic reasoning despite excelling at complex math problems.

Contribution

The paper presents TTT-Bench, a scalable, verifiable benchmark for assessing basic reasoning skills in LRMs through simple yet challenging Tic-Tac-Toe-style games, highlighting gaps in current models.

Findings

01

Models excel at complex math but fail at simple reasoning tasks.

02

Performance drops by 41% compared to math benchmarks.

03

Larger models perform better with shorter reasoning traces.

Abstract

Large reasoning models (LRMs) have demonstrated impressive reasoning capabilities across a broad range of tasks including Olympiad-level mathematical problems, indicating evidence of their complex reasoning abilities. While many reasoning benchmarks focus on the STEM domain, the ability of LRMs to reason correctly in broader task domains remains underexplored. In this work, we introduce \textbf{TTT-Bench}, a new benchmark that is designed to evaluate basic strategic, spatial, and logical reasoning abilities in LRMs through a suite of four two-player Tic-Tac-Toe-style games that humans can effortlessly solve from a young age. We propose a simple yet scalable programmatic approach for generating verifiable two-player game problems for TTT-Bench. Although these games are trivial for humans, they require reasoning about the intentions of the opponent, as well as the game board's spatial…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

Datasets

amd/TTT-Bench
dataset· 98 dl
98 dl

Videos

TTT-Bench: A Benchmark for Evaluating Reasoning Ability with Simple and Novel Tic-Tac-Toe-style Games· underline

Taxonomy

TopicsArtificial Intelligence in Games · Explainable Artificial Intelligence (XAI) · Multimodal Machine Learning Applications

MethodsFocus · Sparse Evolutionary Training