Norm of Mean Contextualized Embeddings Determines their Variance
Hiroaki Yamagiwa, Hidetoshi Shimodaira

TL;DR
This paper investigates how the norm of mean contextualized embeddings relates to their variance, revealing a trade-off influenced by layer normalization in Transformer models and decomposing variance into within- and between-cluster components.
Contribution
It introduces an analysis of the norm-variance relationship in Transformer embeddings, linking it to layer normalization and providing a theoretical decomposition of total variance.
Findings
Strong trade-off between mean norm and variance in embeddings.
Layer depth affects variance distribution, with deeper layers showing increased within-cluster variance.
Variance decomposition aligns with the anisotropy observed in embedding spaces.
Abstract
Contextualized embeddings vary by context, even for the same token, and form a distribution in the embedding space. To analyze this distribution, we focus on the norm of the mean embedding and the variance of the embeddings. In this study, we first demonstrate that these values follow the well-known formula for variance in statistics and provide an efficient sequential computation method. Then, by observing embeddings from intermediate layers of several Transformer models, we found a strong trade-off relationship between the norm and the variance: as the mean embedding becomes closer to the origin, the variance increases. This trade-off is likely influenced by the layer normalization mechanism used in Transformer models. Furthermore, when the sets of token embeddings are treated as clusters, we show that the variance of the entire embedding set can theoretically be decomposed into the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMorphological variations and asymmetry · Functional Brain Connectivity Studies · Bayesian Methods and Mixture Models
MethodsAttention Is All You Need · Sparse Evolutionary Training · Linear Layer · Multi-Head Attention · Label Smoothing · Byte Pair Encoding · Absolute Position Encodings · Softmax · Layer Normalization · Position-Wise Feed-Forward Layer
