TL;DR
This paper introduces similarity encoding, a method for representing high-cardinality, 'dirty' categorical variables by leveraging similarities between categories, leading to improved prediction performance on real-world, non-curated datasets.
Contribution
It proposes a generalized encoding method that captures category similarities, demonstrating significant empirical improvements over traditional encoding techniques on multiple datasets.
Findings
Similarity encoding outperforms one-hot and bag of character n-grams.
3-gram similarity effectively captures morphological resemblance.
Dimensionality reduction maintains performance with lower computational cost.
Abstract
For statistical learning, categorical variables in a table are usually considered as discrete entities and encoded separately to feature vectors, e.g., with one-hot encoding. "Dirty" non-curated data gives rise to categorical variables with a very high cardinality but redundancy: several categories reflect the same entity. In databases, this issue is typically solved with a deduplication step. We show that a simple approach that exposes the redundancy to the learning algorithm brings significant gains. We study a generalization of one-hot encoding, similarity encoding, that builds feature vectors from similarities across categories. We perform a thorough empirical validation on non-curated tables, a problem seldom studied in machine learning. Results on seven real-world datasets show that similarity encoding brings significant gains in prediction in comparison with known encoding…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
