
TL;DR
This paper introduces additive feature hashing, a new method for encoding categorical features that leverages high-dimensional random vectors, offering similar performance to traditional hashing tricks in various machine learning tasks.
Contribution
The paper presents a novel additive hashing approach based on high-dimensional vectors, providing an alternative to the traditional hashing trick with comparable effectiveness.
Findings
Additive feature hashing performs similarly to the traditional hashing trick.
The method is effective across synthetic, language recognition, and spam detection datasets.
High-dimensional properties enable the additive approach to work efficiently.
Abstract
The hashing trick is a machine learning technique used to encode categorical features into a numerical vector representation of pre-defined fixed length. It works by using the categorical hash values as vector indices, and updating the vector values at those indices. Here we discuss a different approach based on additive-hashing and the "almost orthogonal" property of high-dimensional random vectors. That is, we show that additive feature hashing can be performed directly by adding the hash values and converting them into high-dimensional numerical vectors. We show that the performance of additive feature hashing is similar to the hashing trick, and we illustrate the results numerically using synthetic, language recognition, and SMS spam detection data.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAlgorithms and Data Compression · Advanced Image and Video Retrieval Techniques · Spam and Phishing Detection
