Byte Pair Encoding for Efficient Time Series Forecasting
Leon G\"otz, Marcel Kollovieh, Stephan G\"unnemann, Leo Schwinn

TL;DR
This paper introduces a pattern-centric tokenization method for time series that adaptively compresses data using motifs, significantly improving forecasting accuracy and efficiency while enabling effective post-hoc optimization.
Contribution
It proposes the first motif-based tokenization scheme for time series, enhancing model performance and efficiency without additional training overhead.
Findings
Improves forecasting performance by 36%.
Increases efficiency by 1990%.
Reduces MSE by up to 44% with conditional decoding.
Abstract
Existing time series tokenization methods predominantly encode a constant number of samples into individual tokens. This inflexible approach can generate excessive tokens for even simple patterns like extended constant values, resulting in substantial computational overhead. Inspired by the success of byte pair encoding, we propose the first pattern-centric tokenization scheme for time series analysis. Based on a discrete vocabulary of frequent motifs, our method merges samples with underlying patterns into tokens, compressing time series adaptively. Exploiting our finite set of motifs and the continuous properties of time series, we further introduce conditional decoding as a lightweight yet powerful post-hoc optimization method, which requires no gradient computation and adds no computational overhead. On recent time series foundation models, our motif-based tokenization improves…
Peer Reviews
Decision·ICLR 2026 Conference Withdrawn Submission
The adaptation of BPE to time series tokenization is creative and directly tackles the inflexibility of existing methods. The study covers multiple model sizes, tokenizer configurations, and datasets, providing a broad empirical analysis. The examination of token embeddings reveals meaningful correlations with statistical properties, adding depth to the methodology.
The outperformance of patch-based models is not isolated to tokenization, as differences in architecture and training data confound the results. Compression ratios are used as a proxy for efficiency without reporting actual inference times or FLOPs, overlooking potential overheads from tokenization and detokenization. The benefits of adaptive tokenization are not rigorously tested against a range of fixed-patch lengths, leaving open whether adaptivity is truly superior. The paper lacks clear vis
1. The paper presents the first pattern-centric time-series tokenization scheme. By drawing on Byte Pair Encoding from Natural Language Processing, it adaptively compresses repetitive patterns in time series into individual Tokens. 2. The proposed method exhibits excellent computational efficiency, with relatively low computational effort required.
1. The experiments are highly insufficient: Table 2 only includes 5 datasets, with only one setting considered per dataset. Table 9 selects overly outdated models, lacking models from the past two years—for instance, PatchTST, which adopts a fixed patch-partition approach, is not included. Additionally, the experimental results in Table 3 lack persuasiveness; as the authors themselves note, comparisons across different model architectures and pre-training datasets fail to demonstrate that the pr
A. This paper is the first to systematically introduce the highly successful adaptive word segmentation concept (BPE) from NLP into the time series domain, proposing the concept of "pattern tokens." This provides a novel and elegant perspective for solving the problem of representing variable-length patterns in time series. B. In addition to proposing a word segmentation framework, it also introduces the ingenious optimization technique of "conditional decoding." This technique is computational
A. The paper emphasizes the efficiency improvements in the inference stage but doesn't discuss the computational cost of the word segmenter itself (i.e., the pattern vocabulary construction stage). While mentioning that it's "quite inexpensive," quantitative data on the time and memory required to build the vocabulary on large datasets is lacking. B. Conditional decoding is based on the first-order Markov assumption. However, the authors don't mention that for complex sequences with long-term d
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTime Series Analysis and Forecasting · Stock Market Forecasting Methods · Machine Learning in Healthcare
MethodsSparse Evolutionary Training
