Zipfian Whitening
Sho Yokoi, Han Bao, Hiroto Kurita, Hidetoshi Shimodaira

TL;DR
This paper demonstrates that applying Zipfian-weighted PCA whitening to word embeddings, which accounts for the non-uniform distribution of word frequencies, significantly enhances NLP task performance.
Contribution
It introduces a simple yet effective method of Zipfian-weighted whitening for word embeddings and provides a theoretical framework linking it to exponential family distributions and natural language processing models.
Findings
Zipfian-weighted whitening improves task performance.
Theoretical link between word frequency distribution and embedding properties.
Existing NLP methods encode empirical word frequency in embeddings.
Abstract
The word embedding space in neural models is skewed, and correcting this can improve task performance. We point out that most approaches for modeling, correcting, and measuring the symmetry of an embedding space implicitly assume that the word frequencies are uniform; in reality, word frequencies follow a highly non-uniform distribution, known as Zipf's law. Surprisingly, simply performing PCA whitening weighted by the empirical word frequency that follows Zipf's law significantly improves task performance, surpassing established baselines. From a theoretical perspective, both our approach and existing methods can be clearly categorized: word representations are distributed according to an exponential family with either uniform or Zipfian base measures. By adopting the latter approach, we can naturally emphasize informative low-frequency words in terms of their vector norm, which…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsGarlic and Onion Studies · melanin and skin pigmentation · Dye analysis and toxicity
MethodsBalanced Selection · Principal Components Analysis · PCA Whitening
