Prompting Underestimates LLM Capability for Time Series Classification

Dan Schumacher; Erfan Nourbakhsh; Rocky Slavin; Anthony Rios

arXiv:2601.03464·cs.CL·March 13, 2026

Prompting Underestimates LLM Capability for Time Series Classification

Dan Schumacher, Erfan Nourbakhsh, Rocky Slavin, Anthony Rios

PDF

Open Access

TL;DR

This paper reveals that large language models possess significant internal understanding of time series data, but prompt-based evaluations underestimate their capabilities, as linear probes show much higher performance than prompts suggest.

Contribution

It demonstrates that prompt-based assessments underestimate LLMs' time series understanding and shows that internal representations are more informative than prompt outputs.

Findings

01

Linear probes improve F1 scores from 0.15-0.26 to 0.61-0.67

02

Time series information emerges in early transformer layers

03

Prompt-based evaluations underestimate LLMs' capabilities

Abstract

Prompt-based evaluations suggest that large language models (LLMs) perform poorly on time series classification, raising doubts about whether they encode meaningful temporal structure. We show that this conclusion reflects limitations of prompt-based generation rather than the model's representational capacity by directly comparing prompt outputs with linear probes over the same internal representations. While zero-shot prompting performs near chance, linear probes improve average F1 from 0.15-0.26 to 0.61-0.67, often matching or exceeding specialized time series models. Layer-wise analyses further show that class-discriminative time series information emerges in early transformer layers and is amplified by visual and multimodal inputs. Together, these results demonstrate a systematic mismatch between what LLMs internally represent and what prompt-based evaluation reveals, leading…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTime Series Analysis and Forecasting · Language and cultural evolution · Topic Modeling