Linear Transformers as VAR Models: Aligning Autoregressive Attention Mechanisms with Autoregressive Forecasting
Jiecheng Lu, Shihao Yang

TL;DR
This paper demonstrates that linear attention in Transformers can be interpreted as VAR models for time series forecasting, and introduces SAMoVAR, a variant that improves interpretability and performance by aligning architecture with autoregressive objectives.
Contribution
It reveals the VAR structure within linear attention, identifies structural mismatches in multi-layer Transformers, and proposes SAMoVAR, a new model that aligns Transformer architecture with autoregressive forecasting.
Findings
Single linear attention layer models VAR structures.
Multi-layer Transformers can be aligned as VAR models.
SAMoVAR outperforms state-of-the-art TSF models in accuracy and interpretability.
Abstract
Autoregressive attention-based time series forecasting (TSF) has drawn increasing interest, with mechanisms like linear attention sometimes outperforming vanilla attention. However, deeper Transformer architectures frequently misalign with autoregressive objectives, obscuring the underlying VAR structure embedded within linear attention and hindering their ability to capture the data generative processes in TSF. In this work, we first show that a single linear attention layer can be interpreted as a dynamic vector autoregressive (VAR) structure. We then explain that existing multi-layer Transformers have structural mismatches with the autoregressive forecasting objective, which impair interpretability and generalization ability. To address this, we show that by rearranging the MLP, attention, and input-output flow, multi-layer linear attention can also be aligned as a VAR model. Then,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsForecasting Techniques and Applications · Energy Load and Power Forecasting · Stock Market Forecasting Methods
MethodsAttention Is All You Need · Linear Layer · Multi-Head Attention · Position-Wise Feed-Forward Layer · Adam · Softmax · Dropout · Absolute Position Encodings · Label Smoothing · Byte Pair Encoding
