Mechanics of Learned Reasoning 1: TempoBench, A Benchmark for Interpretable Deconstruction of Reasoning System Performance

Nikolaus Holzer; William Fishell; Baishakhi Ray; Mark Santolucito

arXiv:2510.27544·cs.AI·November 3, 2025

Mechanics of Learned Reasoning 1: TempoBench, A Benchmark for Interpretable Deconstruction of Reasoning System Performance

Nikolaus Holzer, William Fishell, Baishakhi Ray, Mark Santolucito

PDF

Open Access

TL;DR

TempoBench is a novel, formally grounded benchmark designed to systematically evaluate large language models' reasoning abilities, especially in complex, multi-step causal and temporal tasks, revealing current limitations in handling increased complexity.

Contribution

The paper introduces TempoBench, the first verifiable diagnostic benchmark that parametrizes difficulty to analyze LLM reasoning performance systematically.

Findings

01

Models score 65.6% on TCE-normal

02

Models score 7.5% on TCE-hard

03

Performance drops significantly with increased complexity

Abstract

Large Language Models (LLMs) are increasingly excelling and outpacing human performance on many tasks. However, to improve LLM reasoning, researchers either rely on ad-hoc generated datasets or formal mathematical proof systems such as the Lean proof assistant. Whilst ad-hoc generated methods can capture the decision chains of real-world reasoning processes, they may encode some inadvertent bias in the space of reasoning they cover; they also cannot be formally verified. On the other hand, systems like Lean can guarantee verifiability, but are not well-suited to capture the nature of agentic decision chain-based tasks. This creates a gap both in performance for functions such as business agents or code assistants, and in the usefulness of LLM reasoning benchmarks, whereby these fall short in reasoning structure or real-world alignment. We introduce TempoBench, the first formally…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsExplainable Artificial Intelligence (XAI) · Topic Modeling · Multimodal Machine Learning Applications