Zipfian Whitening

Sho Yokoi; Han Bao; Hiroto Kurita; Hidetoshi Shimodaira

arXiv:2411.00680·cs.CL·November 4, 2024

Zipfian Whitening

Sho Yokoi, Han Bao, Hiroto Kurita, Hidetoshi Shimodaira

PDF

Open Access 1 Repo 1 Video

TL;DR

This paper demonstrates that applying Zipfian-weighted PCA whitening to word embeddings, which accounts for the non-uniform distribution of word frequencies, significantly enhances NLP task performance.

Contribution

It introduces a simple yet effective method of Zipfian-weighted whitening for word embeddings and provides a theoretical framework linking it to exponential family distributions and natural language processing models.

Findings

01

Zipfian-weighted whitening improves task performance.

02

Theoretical link between word frequency distribution and embedding properties.

03

Existing NLP methods encode empirical word frequency in embeddings.

Abstract

The word embedding space in neural models is skewed, and correcting this can improve task performance. We point out that most approaches for modeling, correcting, and measuring the symmetry of an embedding space implicitly assume that the word frequencies are uniform; in reality, word frequencies follow a highly non-uniform distribution, known as Zipf's law. Surprisingly, simply performing PCA whitening weighted by the empirical word frequency that follows Zipf's law significantly improves task performance, surpassing established baselines. From a theoretical perspective, both our approach and existing methods can be clearly categorized: word representations are distributed according to an exponential family with either uniform or Zipfian base measures. By adopting the latter approach, we can naturally emphasize informative low-frequency words in terms of their vector norm, which…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

cl-tohoku/zipfian-whitening
pytorchOfficial

Videos

Zipfian Whitening· slideslive

Taxonomy

TopicsGarlic and Onion Studies · melanin and skin pigmentation · Dye analysis and toxicity

MethodsBalanced Selection · Principal Components Analysis · PCA Whitening