Unified Normalization for Accelerating and Stabilizing Transformers

Qiming Yang; Kai Zhang; Chaoxiang Lan; Zhi Yang; Zheyang Li; Wenming; Tan; Jun Xiao; Shiliang Pu

arXiv:2208.01313·cs.CV·August 3, 2022

Unified Normalization for Accelerating and Stabilizing Transformers

Qiming Yang, Kai Zhang, Chaoxiang Lan, Zhi Yang, Zheyang Li, Wenming, Tan, Jun Xiao, Shiliang Pu

PDF

1 Repo

TL;DR

This paper introduces Unified Normalization (UN), a hardware-efficient normalization method for Transformers that accelerates inference, stabilizes training, and maintains performance comparable to Layer Normalization, with significant speed and memory benefits.

Contribution

The paper proposes a novel Unified Normalization technique that addresses the inefficiencies and performance issues of existing normalization methods in Transformers.

Findings

01

UN achieves about 31% inference speedup on GPU.

02

UN reduces memory usage by nearly 18%.

03

UN maintains comparable performance to Layer Normalization.

Abstract

Solid results from Transformers have made them prevailing architectures in various natural language and vision tasks. As a default component in Transformers, Layer Normalization (LN) normalizes activations within each token to boost the robustness. However, LN requires on-the-fly statistics calculation in inference as well as division and square root operations, leading to inefficiency on hardware. What is more, replacing LN with other hardware-efficient normalization schemes (e.g., Batch Normalization) results in inferior performance, even collapse in training. We find that this dilemma is caused by abnormal behaviors of activation statistics, including large fluctuations over iterations and extreme outliers across layers. To tackle these issues, we propose Unified Normalization (UN), which can speed up the inference by being fused with other linear operations and achieve comparable…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

hikvision-research/unified-normalization
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

MethodsSPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings · Layer Normalization