TSRBench: A Comprehensive Multi-task Multi-modal Time Series Reasoning Benchmark for Generalist Models

Fangxu Yu; Xingang Guo; Lingzhi Yuan; Haoqiang Kang; Hongyu Zhao; Lianhui Qin; Furong Huang; Bin Hu; Tianyi Zhou

arXiv:2601.18744·cs.AI·May 11, 2026

TSRBench: A Comprehensive Multi-task Multi-modal Time Series Reasoning Benchmark for Generalist Models

Fangxu Yu, Xingang Guo, Lingzhi Yuan, Haoqiang Kang, Hongyu Zhao, Lianhui Qin, Furong Huang, Bin Hu, Tianyi Zhou

PDF

2 Repos 1 Datasets

TL;DR

TSRBench is a comprehensive benchmark for evaluating generalist models' reasoning over multi-modal time series data across diverse domains and tasks, revealing current limitations and guiding future improvements.

Contribution

Introduces TSRBench, a large-scale, multi-modal time series reasoning benchmark with diverse tasks and insights into model capabilities and limitations.

Findings

01

Scaling laws hold for perception and reasoning but not for prediction.

02

Strong reasoning does not ensure accurate context-aware forecasting.

03

Current multimodal models struggle to effectively fuse textual and visual time series inputs.

Abstract

Time series are ubiquitous in real-world scenarios and crucial for applications ranging from energy management to traffic control. Consequently, the ability to reason over time series is a fundamental skill for generalist models to solve complex problems. However, current benchmarks for generalist models largely overlook this dimension. To bridge this gap, we introduce TSRBench, a comprehensive multi-modal benchmark designed to stress-test the full spectrum of time series reasoning capabilities. TSRBench features: i) a diverse set of 4125 problems from 14 domains, and is categorized into 4 major dimensions: Perception, Reasoning, Prediction, and Decision-Making. ii) 15 tasks from the 4 dimensions evaluating essential reasoning capabilities (e.g., numerical reasoning). Through extensive experiments, we evaluate over 30 leading proprietary and open-source LLMs, VLMs, and TSLLMs within…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Datasets

umd-zhou-lab/TSRBench
dataset· 105 dl
105 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.