Quasi-orthonormal Encoding for Machine Learning Applications

Haw-minn Lu

arXiv:2006.00038·cs.LG·June 2, 2020

Quasi-orthonormal Encoding for Machine Learning Applications

Haw-minn Lu

PDF

Open Access 1 Repo

TL;DR

This paper introduces Quasi-orthonormal encoding (QOE), a new method for encoding categorical data in machine learning that addresses limitations of existing schemes like one-hot encoding, especially for high-cardinality categories.

Contribution

The paper proposes QOE as a novel encoding scheme suitable for high-cardinality categorical data, with implementation examples and a demonstration on MNIST.

Findings

01

QOE reduces dimensionality compared to one-hot encoding.

02

QOE is compatible with popular ML libraries like TensorFlow and PyTorch.

03

QOE performs effectively on MNIST handwriting data.

Abstract

Most machine learning models, especially artificial neural networks, require numerical, not categorical data. We briefly describe the advantages and disadvantages of common encoding schemes. For example, one-hot encoding is commonly used for attributes with a few unrelated categories and word embeddings for attributes with many related categories (e.g., words). Neither is suitable for encoding attributes with many unrelated categories, such as diagnosis codes in healthcare applications. Application of one-hot encoding for diagnosis codes, for example, can result in extremely high dimensionality with low sample size problems or artificially induce machine learning artifacts, not to mention the explosion of computing resources needed. Quasi-orthonormal encoding (QOE) fills the gap. We briefly show how QOE compares to one-hot encoding. We provide example code of how to implement QOE using…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Westhealth/scipy2020
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMachine Learning and Data Classification · Machine Learning and Algorithms · Imbalanced Data Classification Techniques