Gini Coefficient as a Unified Metric for Evaluating Many-versus-Many Similarity in Vector Spaces
Ben Fauber

TL;DR
This paper introduces the use of Gini coefficients as a unified metric to evaluate and select similar items in vector spaces, demonstrating broad applicability across image and text data and improving machine learning training sample selection.
Contribution
The paper proposes a novel application of Gini coefficients for evaluating similarity and selecting training samples, showing its effectiveness across multiple data types and outperforming random sampling.
Findings
High Gini coefficients correlate with higher similarity among images.
Gini-based sample selection improves machine learning performance.
Method is effective across image and text vector representations.
Abstract
We demonstrate that Gini coefficients can be used as unified metrics to evaluate many-versus-many (all-to-all) similarity in vector spaces. Our analysis of various image datasets shows that images with the highest Gini coefficients tend to be the most similar to one another, while images with the lowest Gini coefficients are the least similar. We also show that this relationship holds true for vectorized text embeddings from various corpuses, highlighting the consistency of our method and its broad applicability across different types of data. Additionally, we demonstrate that selecting machine learning training samples that closely match the distribution of the testing dataset is far more important than ensuring data diversity. Selection of exemplary and iconic training samples with higher Gini coefficients leads to significantly better model performance compared to simply having a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsCognitive Science and Mapping · Neural Networks and Applications
MethodsSparse Evolutionary Training
