Bloom Filter Encoding for Machine Learning
John Cartmell, Mihaela Cardei, and Ionut Cardei

TL;DR
This paper introduces a Bloom filter-based encoding method for data preprocessing in machine learning, offering memory efficiency and data obfuscation while maintaining model performance across diverse datasets.
Contribution
The authors propose a novel Bloom filter transform for data encoding that is simple, flexible, and effective across multiple data types and classifiers.
Findings
Models trained on Bloom filter encodings perform comparably to raw data.
The method reduces memory usage across datasets.
Encoding provides data obfuscation while preserving similarity structures.
Abstract
We present a method that uses a Bloom filter transform to preprocess data for machine learning. Each sample is encoded into a compact bit-array representation using hash-based encoding, producing a fixed-length feature space that reduces memory usage and obfuscates original feature values. The encoding does not rely on keyed hashing; however, a key can optionally be used to control the mapping and would be required to reproduce the representation. We evaluate the approach on six datasets spanning text, time-series, tabular, and image domains: SMS Spam Collection, ECG200, Adult 50K, CDC Diabetes, MNIST, and Fashion MNIST. Four classifiers are considered: Extreme Gradient Boosting, Deep Neural Networks, Convolutional Neural Networks, and Logistic Regression. Results show that models trained on Bloom filter encodings achieve performance comparable to models trained on raw data or standard…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
