UnitNorm: Rethinking Normalization for Transformers in Time Series

Nan Huang; Christian K\"ummerle; Xiang Zhang

arXiv:2405.15903·cs.LG·May 28, 2024·1 cites

UnitNorm: Rethinking Normalization for Transformers in Time Series

Nan Huang, Christian K\"ummerle, Xiang Zhang

PDF

Open Access 4 Reviews

TL;DR

UnitNorm introduces a novel normalization method for Transformer models in time series analysis, improving performance and stability by addressing issues caused by traditional normalization techniques.

Contribution

It proposes UnitNorm, a new normalization approach that scales vectors by their norms, effectively enhancing attention mechanisms in time series Transformers.

Findings

01

Significant MSE reduction in forecasting tasks.

02

Improved accuracy in classification tasks.

03

Demonstrated robustness across multiple datasets and models.

Abstract

Normalization techniques are crucial for enhancing Transformer models' performance and stability in time series analysis tasks, yet traditional methods like batch and layer normalization often lead to issues such as token shift, attention shift, and sparse attention. We propose UnitNorm, a novel approach that scales input vectors by their norms and modulates attention patterns, effectively circumventing these challenges. Grounded in existing normalization frameworks, UnitNorm's effectiveness is demonstrated across diverse time series analysis tasks, including forecasting, classification, and anomaly detection, via a rigorous evaluation on 6 state-of-the-art models and 10 datasets. Notably, UnitNorm shows superior performance, especially in scenarios requiring robust attention mechanisms and contextual comprehension, evidenced by significant improvements by up to a 1.46 decrease in MSE…

Peer Reviews

Decision·ICLR 2026 Conference Withdrawn Submission

Reviewer 01Rating 2Confidence 4

Strengths

- Normalization in time-series Transformers is an under-examined but increasingly important research direction. - The discussion on vector orientation, sign-preservation, and attention fidelity provides useful intuition to the field. - Evaluation spans multiple time-series tasks: forecasting, classification, and anomaly detection.

Weaknesses

- PatchTST, FEDFormer, and CrossFormer performances in this paper are significantly weaker than official numbers, raising validity concerns for all reported gains. - PatchTST relies critically on RevIN (Kim et el., ICLR 2022), yet it is unclear whether RevIN is applied fully or consistently. This is a major concern because the presence or absence of RevIN fundamentally alters the statistical properties of time-series input (scale distribution, domain-shift behavior, and vector directions), there

Reviewer 02Rating 4Confidence 4

Strengths

1. The approach is well motivated, clearly identifies normalization–attention interactions as a key issue, framing token/attention shift as a quantifiable phenomenon. 2. Solid theoretical analysis — Includes formal theorems (sign-flip probability, gradient invariance, entropy bound) with proofs in the appendix, giving strong mathematical grounding. 3./Implementation simplicity — UnitNorm can directly replace existing normalization layers, making it easy to adopt.

Weaknesses

1. Insufficient empirical validation on large-scale datasets and models. The paper mainly uses small to medium-sized benchmarks (ETTh1/2, ECL, Exchange, Solar). However, recent works such as Chronos2, Moirai, and Sundial have trained on large-scale datasets (LOSTA e.g.), together with the large-scale, multi-domain benchmarks (GIFT-EVAL, FEV) that test model robustness and scalability. Without evaluation on these, the claim that UnitNorm “generalizes across time-series domains” is not sufficientl

Reviewer 03Rating 4Confidence 4

Strengths

1. A minimal, easily implementable normalization layer, drop-in for many backbones. (The code is provided with reproducible information.) 2. Links token scale to attention entropy/sparsity with interpretable bounds. This helps reason about when cantering may hurt. 3. Linking normalization to attention entropy via Theorems 3.2-3.3 is insightful and not addressed in prior work. The entropy lower bound provides actionable guidance. 4. Covers multiple architectures and tasks, including forecasting,

Weaknesses

1. Missing time-series baselines for non-stationarity. No comparison with RevIN (input-level reversible instance normalization) and related methods (e.g., DAIN[4]) that explicitly target distribution/regime shifts-central to the paper’s motivation. 2. Incremental novelty vs. RMSNorm[3]. Algorithmically close to RMSNorm (scale-only) with a fixed modulus; the core novelty is the explicit norm target and entropy framing. More distinctive empirical behavior is needed to clear a top-tier bar. 3. k-pa

Reviewer 04Rating 4Confidence 3

Strengths

- The paper provides thorough and well-motivated theoretical analysis regarding the limitations of conventional normalization methods, identifying three potential issues—token shift, attention shift, and sparse attention—that could degrade Transformer performance on time series tasks. - The effectiveness of the proposed UnitNorm is validated across diverse time series applications, including forecasting, classification, and anomaly detection. The experimental results demonstrate that the UnitNor

Weaknesses

- Some claims lack sufficient justification: - In *line 123*, the statement *“altering token vector orientations and obscuring long-range dependencies”* is vague. It would be helpful to provide clearer theoretical or empirical evidence explaining why standard normalization methods cause this issue. - In *line 267*, the paper claims that sparse attention is problematic in TSA tasks, but this claim is not sufficiently explained or analyzed. Additional clarification or supporting experiments wo

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTime Series Analysis and Forecasting

MethodsAttention Is All You Need · Linear Layer · Multi-Head Attention · Softmax · Layer Normalization · Byte Pair Encoding · Label Smoothing · Position-Wise Feed-Forward Layer · Adam · Dense Connections