QSTRBench: a New Benchmark to Evaluate the Ability of Language Models to Reason with Qualitative Spatial and Temporal Calculi
Anthony G. Cohn, Robert E. Blackwell

TL;DR
This paper introduces QSTRBench, a comprehensive benchmark for evaluating large language models' reasoning abilities in qualitative spatial and temporal calculi, highlighting current performance gaps.
Contribution
It presents the first extensive benchmark including RCC-22, systematically varying question formats, and evaluates contemporary models' reasoning capabilities in QSTR.
Findings
Models perform better than guessing but lack consistency.
Performance varies significantly across different calculi.
RCC-22 is the most challenging calculus for current models.
Abstract
We introduce an extensive qualitative spatial and temporal reasoning (QSTR) benchmark for evaluating large language models (LLMs). We pose questions concerning compositional reasoning (using composition tables, CT), converse relations, and conceptual neighbourhoods (CN) for QSTR calculi, Point Algebra (PA), Allen's Interval Algebra, Interval and Duration (INDU), Region Connection Calculus (RCC-5, RCC-8, and RCC-22), the nine intersection model, cardinal direction calculus, and STAR. The RCC-22 CN is published here for the first time. An extended benchmark systematically varies question presentation including prefix/infix, words/symbols/nonce terms and schematic descriptions for selected calculi. We report results for contemporary frontier models. All models tested perform better than guessing but none can consistently answer all questions correctly. Performance varies sharply by…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
