TL;DR
VarDrop is a novel method that reduces variate token redundancy in multivariate time series forecasting, significantly improving training efficiency by using frequency-based grouping and sparse attention.
Contribution
The paper introduces VarDrop, a new strategy that adaptively omits redundant variate tokens using frequency hashing and stratified sampling to enhance training efficiency.
Findings
VarDrop outperforms existing efficient baselines on benchmark datasets.
It significantly reduces computational cost of attention in multivariate forecasting.
The method maintains forecasting accuracy while improving efficiency.
Abstract
Variate tokenization, which independently embeds each variate as separate tokens, has achieved remarkable improvements in multivariate time series forecasting. However, employing self-attention with variate tokens incurs a quadratic computational cost with respect to the number of variates, thus limiting its training efficiency for large-scale applications. To address this issue, we propose VarDrop, a simple yet efficient strategy that reduces the token usage by omitting redundant variate tokens during training. VarDrop adaptively excludes redundant tokens within a given batch, thereby reducing the number of tokens used for dot-product attention while preserving essential information. Specifically, we introduce k-dominant frequency hashing (k-DFH), which utilizes the ranked dominant frequencies in the frequency domain as a hash value to efficiently group variate tokens exhibiting…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
