Linear Transformers as VAR Models: Aligning Autoregressive Attention Mechanisms with Autoregressive Forecasting

Jiecheng Lu; Shihao Yang

arXiv:2502.07244·cs.LG·February 6, 2026

Linear Transformers as VAR Models: Aligning Autoregressive Attention Mechanisms with Autoregressive Forecasting

Jiecheng Lu, Shihao Yang

PDF

Open Access 1 Repo

TL;DR

This paper demonstrates that linear attention in Transformers can be interpreted as VAR models for time series forecasting, and introduces SAMoVAR, a variant that improves interpretability and performance by aligning architecture with autoregressive objectives.

Contribution

It reveals the VAR structure within linear attention, identifies structural mismatches in multi-layer Transformers, and proposes SAMoVAR, a new model that aligns Transformer architecture with autoregressive forecasting.

Findings

01

Single linear attention layer models VAR structures.

02

Multi-layer Transformers can be aligned as VAR models.

03

SAMoVAR outperforms state-of-the-art TSF models in accuracy and interpretability.

Abstract

Autoregressive attention-based time series forecasting (TSF) has drawn increasing interest, with mechanisms like linear attention sometimes outperforming vanilla attention. However, deeper Transformer architectures frequently misalign with autoregressive objectives, obscuring the underlying VAR structure embedded within linear attention and hindering their ability to capture the data generative processes in TSF. In this work, we first show that a single linear attention layer can be interpreted as a dynamic vector autoregressive (VAR) structure. We then explain that existing multi-layer Transformers have structural mismatches with the autoregressive forecasting objective, which impair interpretability and generalization ability. To address this, we show that by rearranging the MLP, attention, and input-output flow, multi-layer linear attention can also be aligned as a VAR model. Then,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

ljc-fvnr/structural-aligned-mixture-of-var
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsForecasting Techniques and Applications · Energy Load and Power Forecasting · Stock Market Forecasting Methods

MethodsAttention Is All You Need · Linear Layer · Multi-Head Attention · Position-Wise Feed-Forward Layer · Adam · Softmax · Dropout · Absolute Position Encodings · Label Smoothing · Byte Pair Encoding