
TL;DR
This paper presents Modular Linear Tokenization (MLT), a reversible, deterministic encoding method for high-cardinality categorical data that maintains bijective mappings, offers scalability, and reduces parameters compared to traditional methods.
Contribution
MLT introduces a novel reversible encoding technique using modular arithmetic and linear transformations, enabling scalable and efficient categorical data representation.
Findings
MLT achieves comparable predictive performance to supervised embeddings.
MLT requires fewer parameters and has lower training costs.
MLT maintains full reversibility for millions of identifiers.
Abstract
This paper introduces Modular Linear Tokenization (MLT), a reversible and deterministic technique for encoding high-cardinality categorical identifiers into compact numerical vectors. Unlike traditional hashing or one-hot encodings, MLT preserves bijective mappings by leveraging modular arithmetic over finite fields and invertible linear transformations. The method offers explicit control of dimensionality and computational scalability while maintaining full reversibility, even for millions of identifiers. Experimental results on the MovieLens 20M dataset show that MLT achieves comparable predictive performance to supervised embeddings while requiring significantly fewer parameters and lower training cost. An open-source implementation of MLT is available on PyPI (https://pypi.org/project/light-mlt/) and GitHub (https://github.com/tcharliesschmitz/light-mlt).
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
