Entity Embeddings of Categorical Variables
Cheng Guo, Felix Berkhahn

TL;DR
This paper introduces entity embeddings for categorical variables, learned via neural networks, which improve model performance, reduce memory, and reveal intrinsic data properties, especially in high-cardinality and sparse datasets.
Contribution
It presents a novel method for embedding categorical variables into Euclidean space during neural network training, enhancing generalization and interpretability over traditional encoding methods.
Findings
Entity embeddings improve neural network performance on categorical data.
Embeddings help visualize and cluster categorical variables.
Method reduces overfitting in high-cardinality datasets.
Abstract
We map categorical variables in a function approximation problem into Euclidean spaces, which are the entity embeddings of the categorical variables. The mapping is learned by a neural network during the standard supervised training process. Entity embedding not only reduces memory usage and speeds up neural networks compared with one-hot encoding, but more importantly by mapping similar values close to each other in the embedding space it reveals the intrinsic properties of the categorical variables. We applied it successfully in a recent Kaggle competition and were able to reach the third position with relative simple features. We further demonstrate in this paper that entity embedding helps the neural network to generalize better when the data is sparse and statistics is unknown. Thus it is especially useful for datasets with lots of high cardinality features, where other methods…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsBayesian Modeling and Causal Inference · Topic Modeling · Advanced Text Analysis Techniques
