SPAT: Sensitivity-based Multihead-attention Pruning on Time Series Forecasting Models

Suhan Guo; Jiahong Deng; Mengjun Yi; Furao Shen; Jian Zhao

arXiv:2505.08768·cs.LG·May 14, 2025

SPAT: Sensitivity-based Multihead-attention Pruning on Time Series Forecasting Models

Suhan Guo, Jiahong Deng, Mengjun Yi, Furao Shen, Jian Zhao

PDF

TL;DR

SPAT introduces a sensitivity-based structured pruning method that removes entire attention modules in time series forecasting models, reducing computational costs and improving efficiency without hardware demands.

Contribution

The paper presents a novel dynamic sensitivity metric, SEND, for selectively pruning attention modules, leading to more efficient models that outperform existing lightweight methods.

Findings

01

Achieved 2.842% reduction in MSE and 1.996% in MAE.

02

Reduced FLOPs by 35.274%.

03

Outperformed state-of-the-art methods in standard and zero-shot inference.

Abstract

Attention-based architectures have achieved superior performance in multivariate time series forecasting but are computationally expensive. Techniques such as patching and adaptive masking have been developed to reduce their sizes and latencies. In this work, we propose a structured pruning method, SPAT ( $S$ ensitivity $P$ runer for $At$ tention), which selectively removes redundant attention mechanisms and yields highly effective models. Different from previous approaches, SPAT aims to remove the entire attention module, which reduces the risk of overfitting and enables speed-up without demanding specialized hardware. We propose a dynamic sensitivity metric, $S$ ensitivity $E$ nhanced $N$ ormalized $D$ ispersion (SEND) that measures the importance of each attention module during the pre-training phase. Experiments on multivariate…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

MethodsSoftmax · Attention Is All You Need · L1 Regularization · Activation Patching · Adaptive Masking · Masked autoencoder · Pruning