Norm of Mean Contextualized Embeddings Determines their Variance

Hiroaki Yamagiwa; Hidetoshi Shimodaira

arXiv:2409.11253·cs.CL·December 18, 2024

Norm of Mean Contextualized Embeddings Determines their Variance

Hiroaki Yamagiwa, Hidetoshi Shimodaira

PDF

Open Access 1 Repo

TL;DR

This paper investigates how the norm of mean contextualized embeddings relates to their variance, revealing a trade-off influenced by layer normalization in Transformer models and decomposing variance into within- and between-cluster components.

Contribution

It introduces an analysis of the norm-variance relationship in Transformer embeddings, linking it to layer normalization and providing a theoretical decomposition of total variance.

Findings

01

Strong trade-off between mean norm and variance in embeddings.

02

Layer depth affects variance distribution, with deeper layers showing increased within-cluster variance.

03

Variance decomposition aligns with the anisotropy observed in embedding spaces.

Abstract

Contextualized embeddings vary by context, even for the same token, and form a distribution in the embedding space. To analyze this distribution, we focus on the norm of the mean embedding and the variance of the embeddings. In this study, we first demonstrate that these values follow the well-known formula for variance in statistics and provide an efficient sequential computation method. Then, by observing embeddings from intermediate layers of several Transformer models, we found a strong trade-off relationship between the norm and the variance: as the mean embedding becomes closer to the origin, the variance increases. This trade-off is likely influenced by the layer normalization mechanism used in Transformer models. Furthermore, when the sets of token embeddings are treated as clusters, we show that the variance of the entire embedding set can theoretically be decomposed into the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

ymgw55/Norm-and-Variance
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMorphological variations and asymmetry · Functional Brain Connectivity Studies · Bayesian Methods and Mixture Models

MethodsAttention Is All You Need · Sparse Evolutionary Training · Linear Layer · Multi-Head Attention · Label Smoothing · Byte Pair Encoding · Absolute Position Encodings · Softmax · Layer Normalization · Position-Wise Feed-Forward Layer