Categorical Feature Compression via Submodular Optimization
MohammadHossein Bateni, Lin Chen, Hossein Esfandiari, Thomas Fu, Vahab, S. Mirrokni, Afshin Rostamizadeh

TL;DR
This paper introduces a scalable, submodular optimization-based algorithm for compressing large categorical vocabularies, preserving mutual information with labels and enabling efficient distributed implementation.
Contribution
It presents a novel submodular formulation for categorical feature compression with provable approximation guarantees and scalable algorithms suitable for large-scale data.
Findings
Achieves near-optimal mutual information retention in large vocabularies.
Operates in $O(n \, \log n)$ time with a distributed implementation.
Demonstrates improved performance on the Criteo dataset.
Abstract
In the era of big data, learning from categorical features with very large vocabularies (e.g., 28 million for the Criteo click prediction dataset) has become a practical challenge for machine learning researchers and practitioners. We design a highly-scalable vocabulary compression algorithm that seeks to maximize the mutual information between the compressed categorical feature and the target binary labels and we furthermore show that its solution is guaranteed to be within a factor of the global optimal solution. To achieve this, we introduce a novel re-parametrization of the mutual information objective, which we prove is submodular, and design a data structure to query the submodular function in amortized time (where is the input vocabulary size). Our complete algorithm is shown to operate in time. Additionally, we design a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsComplexity and Algorithms in Graphs · Machine Learning and Algorithms · Advanced Graph Neural Networks
