MMTS-BENCH: A Comprehensive Benchmark for Time Series Understanding and Reasoning

Yao Yin; Zhenyu Xiao; Musheng Li; Yiwen Liu; Sutong Nan; Yiting He; Ruiqi Wang; Zhenwei Zhang; Qingmin Liao; Yuantao Gu

arXiv:2602.08588·cs.DB·February 10, 2026

MMTS-BENCH: A Comprehensive Benchmark for Time Series Understanding and Reasoning

Yao Yin, Zhenyu Xiao, Musheng Li, Yiwen Liu, Sutong Nan, Yiting He, Ruiqi Wang, Zhenwei Zhang, Qingmin Liao, Yuantao Gu

PDF

Open Access

TL;DR

MMTS-BENCH is a comprehensive benchmark designed to evaluate large language models on diverse time series understanding and reasoning tasks, highlighting current limitations and guiding future improvements.

Contribution

Introduces MMTS-BENCH, a hierarchical, multimodal benchmark with 2,424 QA pairs for assessing LLMs on time series tasks, and provides extensive evaluation insights.

Findings

01

TS-LLMs lag behind general-purpose LLMs in cross-domain tasks.

02

LLMs perform worse on local than global time series tasks.

03

Chain-of-thought reasoning and multimodal integration improve performance.

Abstract

Time series data are central to domains such as finance, healthcare, and cloud computing, yet existing benchmarks for evaluating various large language models (LLMs) on temporal tasks remain scattered and unsystematic. To bridge this gap, we introduce MMTS-BENCH, a comprehensive multimodal benchmark built upon a hierarchical taxonomy of time-series tasks, spanning structural awareness, feature analysis, temporal reasoning, sequence matching and cross-modal alignment. MMTS-BENCH comprises 2,424 time series question answering (TSQA) pairs across 4 subsets: Base, InWild, Match, and Align, generated through a progressive real-world QA framework and modular synthetic data construction. We conduct extensive evaluations on closed-source, open-source LLMs and existing time series adapted large language models (TS-LLMs), revealing that: (1) TS-LLMs significantly lag behind general-purpose LLMs…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Machine Learning in Healthcare · Time Series Analysis and Forecasting