Probabilistic Hash Embeddings for Online Learning of Categorical Features
Aodong Li, Abishek Sankararaman, Balakrishnan Narayanaswamy

TL;DR
This paper introduces a probabilistic hash embedding model for online learning of categorical features, addressing issues of order sensitivity and forgetting in deterministic embeddings, and demonstrating superior performance and memory efficiency.
Contribution
The paper proposes a Bayesian online learning approach for hash embeddings that handles evolving vocabularies and maintains performance without increasing memory.
Findings
Outperforms traditional methods in online classification and recommendation tasks.
Maintains high accuracy while using only 2-4 times the memory of one-hot embeddings.
Effectively handles growing and changing categorical vocabularies.
Abstract
We study streaming data with categorical features where the vocabulary of categorical feature values is changing and can even grow unboundedly over time. Feature hashing is commonly used as a pre-processing step to map these categorical values into a feature space of fixed size before learning their embeddings. While these methods have been developed and evaluated for offline or batch settings, in this paper we consider online settings. We show that deterministic embeddings are sensitive to the arrival order of categories and suffer from forgetting in online learning, leading to performance deterioration. To mitigate this issue, we propose a probabilistic hash embedding (PHE) model that treats hash embeddings as stochastic and applies Bayesian online learning to learn incrementally from data. Based on the structure of PHE, we derive a scalable inference algorithm to learn model…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsAdvanced Image and Video Retrieval Techniques · Data Stream Mining Techniques · Information Retrieval and Search Behavior
