QSTRBench: a New Benchmark to Evaluate the Ability of Language Models to Reason with Qualitative Spatial and Temporal Calculi

Anthony G. Cohn; Robert E. Blackwell

arXiv:2605.18380·cs.AI·May 19, 2026

QSTRBench: a New Benchmark to Evaluate the Ability of Language Models to Reason with Qualitative Spatial and Temporal Calculi

Anthony G. Cohn, Robert E. Blackwell

PDF

TL;DR

This paper introduces QSTRBench, a comprehensive benchmark for evaluating large language models' reasoning abilities in qualitative spatial and temporal calculi, highlighting current performance gaps.

Contribution

It presents the first extensive benchmark including RCC-22, systematically varying question formats, and evaluates contemporary models' reasoning capabilities in QSTR.

Findings

01

Models perform better than guessing but lack consistency.

02

Performance varies significantly across different calculi.

03

RCC-22 is the most challenging calculus for current models.

Abstract

We introduce an extensive qualitative spatial and temporal reasoning (QSTR) benchmark for evaluating large language models (LLMs). We pose questions concerning compositional reasoning (using composition tables, CT), converse relations, and conceptual neighbourhoods (CN) for QSTR calculi, Point Algebra (PA), Allen's Interval Algebra, Interval and Duration (INDU), Region Connection Calculus (RCC-5, RCC-8, and RCC-22), the nine intersection model, cardinal direction calculus, and STAR. The RCC-22 CN is published here for the first time. An extended benchmark systematically varies question presentation including prefix/infix, words/symbols/nonce terms and schematic descriptions for selected calculi. We report results for contemporary frontier models. All models tested perform better than guessing but none can consistently answer all questions correctly. Performance varies sharply by…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.