TimeDistill: Efficient Long-Term Time Series Forecasting with MLP via Cross-Architecture Distillation

Juntong Ni; Zewen Liu; Shiyu Wang; Ming Jin; Wei Jin

arXiv:2502.15016·cs.LG·January 8, 2026

TimeDistill: Efficient Long-Term Time Series Forecasting with MLP via Cross-Architecture Distillation

Juntong Ni, Zewen Liu, Shiyu Wang, Ming Jin, Wei Jin

PDF

TL;DR

TimeDistill introduces a knowledge distillation framework that enables lightweight MLP models to efficiently perform long-term time series forecasting by capturing complex patterns typically modeled by more resource-intensive architectures.

Contribution

The paper presents a novel cross-architecture knowledge distillation method that transfers multi-scale and multi-period patterns from Transformers and CNNs to MLPs, improving efficiency and accuracy.

Findings

01

MLP with TimeDistill outperforms teacher models by up to 18.6%.

02

Achieves up to 7X faster inference and 130X fewer parameters.

03

Effective across eight diverse datasets.

Abstract

Transformer-based and CNN-based methods demonstrate strong performance in long-term time series forecasting. However, their high computational and storage requirements can hinder large-scale deployment. To address this limitation, we propose integrating lightweight MLP with advanced architectures using knowledge distillation (KD). Our preliminary study reveals different models can capture complementary patterns, particularly multi-scale and multi-period patterns in the temporal and frequency domains. Based on this observation, we introduce TimeDistill, a cross-architecture KD framework that transfers these patterns from teacher models (e.g., Transformers, CNNs) to MLP. Additionally, we provide a theoretical analysis, demonstrating that our KD approach can be interpreted as a specialized form of mixup data augmentation. TimeDistill improves MLP performance by up to 18.6%, surpassing…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

MethodsMixup · Knowledge Distillation