Quasi-orthonormal Encoding for Machine Learning Applications
Haw-minn Lu

TL;DR
This paper introduces Quasi-orthonormal encoding (QOE), a new method for encoding categorical data in machine learning that addresses limitations of existing schemes like one-hot encoding, especially for high-cardinality categories.
Contribution
The paper proposes QOE as a novel encoding scheme suitable for high-cardinality categorical data, with implementation examples and a demonstration on MNIST.
Findings
QOE reduces dimensionality compared to one-hot encoding.
QOE is compatible with popular ML libraries like TensorFlow and PyTorch.
QOE performs effectively on MNIST handwriting data.
Abstract
Most machine learning models, especially artificial neural networks, require numerical, not categorical data. We briefly describe the advantages and disadvantages of common encoding schemes. For example, one-hot encoding is commonly used for attributes with a few unrelated categories and word embeddings for attributes with many related categories (e.g., words). Neither is suitable for encoding attributes with many unrelated categories, such as diagnosis codes in healthcare applications. Application of one-hot encoding for diagnosis codes, for example, can result in extremely high dimensionality with low sample size problems or artificially induce machine learning artifacts, not to mention the explosion of computing resources needed. Quasi-orthonormal encoding (QOE) fills the gap. We briefly show how QOE compares to one-hot encoding. We provide example code of how to implement QOE using…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMachine Learning and Data Classification · Machine Learning and Algorithms · Imbalanced Data Classification Techniques
