Reliable Measures of Spread in High Dimensional Latent Spaces
Anna C. Marbut, Katy McKinney-Bock, Travis J. Wheeler

TL;DR
This paper evaluates existing metrics for measuring data spread in high-dimensional latent spaces of NLP models, identifies their shortcomings, and proposes new, more reliable measures based on principal components and entropy.
Contribution
It introduces eight alternative measures of data spread, recommending two that reliably compare models of varying sizes and dimensions.
Findings
Existing measures are unreliable for comparing latent space spread.
Eight new measures are proposed and tested on synthetic data.
Two measures are recommended for practical use.
Abstract
Understanding geometric properties of natural language processing models' latent spaces allows the manipulation of these properties for improved performance on downstream tasks. One such property is the amount of data spread in a model's latent space, or how fully the available latent space is being used. In this work, we define data spread and demonstrate that the commonly used measures of data spread, Average Cosine Similarity and a partition function min/max ratio I(V), do not provide reliable metrics to compare the use of latent space across models. We propose and examine eight alternative measures of data spread, all but one of which improve over these current metrics when applied to seven synthetic data distributions. Of our proposed measures, we recommend one principal component-based measure and one entropy-based measure that provide reliable, relative measures of spread and can…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Speech and dialogue systems
