TEMPO: A Realistic Multi-Domain Benchmark for Temporal Reasoning-Intensive Retrieval

Abdelrahman Abdallah; Mohammed Ali; Muhammad Abdul-Mageed; Adam Jatowt

arXiv:2601.09523·cs.IR·January 15, 2026

TEMPO: A Realistic Multi-Domain Benchmark for Temporal Reasoning-Intensive Retrieval

Abdelrahman Abdallah, Mohammed Ali, Muhammad Abdul-Mageed, Adam Jatowt

PDF

Open Access

TL;DR

TEMPO is a new benchmark designed to evaluate complex temporal reasoning in retrieval systems across multiple domains, highlighting current challenges and guiding future improvements.

Contribution

It introduces the first multi-domain benchmark combining deep temporal reasoning with retrieval, including new metrics and detailed evaluation steps.

Findings

01

Best model achieves only 32.0 NDCG@10, indicating high difficulty.

02

Substantial challenges remain in retrieving temporally complete evidence.

03

Benchmark covers 13 domains with complex, multi-step queries.

Abstract

Existing temporal QA benchmarks focus on simple fact-seeking queries from news corpora, while reasoning-intensive retrieval benchmarks lack temporal grounding. However, real-world information needs often require reasoning about temporal evolution and synthesizing evidence across time periods. We introduce TEMPO, the first benchmark combining temporal reasoning with reasoning-intensive retrieval across 13 domains. TEMPO features: (1) 1,730 complex queries requiring deep temporal reasoning such as tracking changes, identifying trends, or comparing cross-period evidence; (2) step-wise retrieval planning with 3,976 decomposed steps and gold documents mapped to each step for multi-hop evaluation; and (3) novel temporal metrics including Temporal Coverage@k and Temporal Precision@k measuring whether results span required time periods. Evaluation of 12 retrieval systems reveals substantial…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsInformation Retrieval and Search Behavior · Topic Modeling · Multimodal Machine Learning Applications