TS-Haystack: A Multi-Task Retrieval Benchmark for Long-Context Time-Series Reasoning

Nicolas Zumarraga; Thomas Kaar; Ning Wang; William Tennien; Alpay Hasanli; Max Rosenblattl; Fan Wu; Kevin Riehl; Maxwell A. Xu; Markus Kreft; Kevin O'Sullivan; Elgar Fleisch; Paul Schmiedmayer; Robert Jakob; Patrick Langer

arXiv:2602.14200·cs.LG·May 14, 2026

TS-Haystack: A Multi-Task Retrieval Benchmark for Long-Context Time-Series Reasoning

Nicolas Zumarraga, Thomas Kaar, Ning Wang, William Tennien, Alpay Hasanli, Max Rosenblattl, Fan Wu, Kevin Riehl, Maxwell A. Xu, Markus Kreft, Kevin O'Sullivan, Elgar Fleisch, Paul Schmiedmayer, Robert Jakob, Patrick Langer

PDF

TL;DR

This paper introduces TS-Haystack, a comprehensive benchmark for evaluating long-context reasoning in time-series language models across multiple domains and tasks.

Contribution

It presents TS-Haystack, a new multi-domain retrieval benchmark, and demonstrates that agentic retrieval frameworks outperform existing models on most tasks.

Findings

01

Existing TSLMs show severe degradation with longer contexts.

02

Memory limitations cause tokenization models to fail beyond 100 seconds.

03

Agentic retrieval with specialized tools outperforms state-of-the-art models on most tasks.

Abstract

Time Series Language Models (TSLMs) promise reasoning over real-world temporal data, but their ability to retrieve and reason over long time-series remains largely untested. We introduce TS-Haystack, a multi-domain retrieval benchmark with ten event-grounded question-answering tasks over contexts from 100 seconds to 24 hours, spanning direct retrieval, temporal reasoning, multi-step reasoning, and contextual anomaly detection. Existing TSLMs exhibit severe long-context degradation: accuracy declines with context length, direct-tokenization models run out of memory beyond 100 seconds on high-rate signals, and time-interval-grounded tasks collapse toward near-zero accuracy when increasing the time-series lengths, aligning with existing literature on text and multi-modal long context retrieval. An agentic retrieval framework using specialized time-series classifier tools matches or…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.