TimeDistill: Efficient Long-Term Time Series Forecasting with MLP via Cross-Architecture Distillation
Juntong Ni, Zewen Liu, Shiyu Wang, Ming Jin, Wei Jin

TL;DR
TimeDistill introduces a knowledge distillation framework that enables lightweight MLP models to efficiently perform long-term time series forecasting by capturing complex patterns typically modeled by more resource-intensive architectures.
Contribution
The paper presents a novel cross-architecture knowledge distillation method that transfers multi-scale and multi-period patterns from Transformers and CNNs to MLPs, improving efficiency and accuracy.
Findings
MLP with TimeDistill outperforms teacher models by up to 18.6%.
Achieves up to 7X faster inference and 130X fewer parameters.
Effective across eight diverse datasets.
Abstract
Transformer-based and CNN-based methods demonstrate strong performance in long-term time series forecasting. However, their high computational and storage requirements can hinder large-scale deployment. To address this limitation, we propose integrating lightweight MLP with advanced architectures using knowledge distillation (KD). Our preliminary study reveals different models can capture complementary patterns, particularly multi-scale and multi-period patterns in the temporal and frequency domains. Based on this observation, we introduce TimeDistill, a cross-architecture KD framework that transfers these patterns from teacher models (e.g., Transformers, CNNs) to MLP. Additionally, we provide a theoretical analysis, demonstrating that our KD approach can be interpreted as a specialized form of mixup data augmentation. TimeDistill improves MLP performance by up to 18.6%, surpassing…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsMixup · Knowledge Distillation
