TEMPO: A Realistic Multi-Domain Benchmark for Temporal Reasoning-Intensive Retrieval
Abdelrahman Abdallah, Mohammed Ali, Muhammad Abdul-Mageed, Adam Jatowt

TL;DR
TEMPO is a new benchmark designed to evaluate complex temporal reasoning in retrieval systems across multiple domains, highlighting current challenges and guiding future improvements.
Contribution
It introduces the first multi-domain benchmark combining deep temporal reasoning with retrieval, including new metrics and detailed evaluation steps.
Findings
Best model achieves only 32.0 NDCG@10, indicating high difficulty.
Substantial challenges remain in retrieving temporally complete evidence.
Benchmark covers 13 domains with complex, multi-step queries.
Abstract
Existing temporal QA benchmarks focus on simple fact-seeking queries from news corpora, while reasoning-intensive retrieval benchmarks lack temporal grounding. However, real-world information needs often require reasoning about temporal evolution and synthesizing evidence across time periods. We introduce TEMPO, the first benchmark combining temporal reasoning with reasoning-intensive retrieval across 13 domains. TEMPO features: (1) 1,730 complex queries requiring deep temporal reasoning such as tracking changes, identifying trends, or comparing cross-period evidence; (2) step-wise retrieval planning with 3,976 decomposed steps and gold documents mapped to each step for multi-hop evaluation; and (3) novel temporal metrics including Temporal Coverage@k and Temporal Precision@k measuring whether results span required time periods. Evaluation of 12 retrieval systems reveals substantial…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsInformation Retrieval and Search Behavior · Topic Modeling · Multimodal Machine Learning Applications
