Tabular Learning: Encoding for Entity and Context Embeddings
Fredy Reusser

TL;DR
This paper evaluates various encoding techniques for entity and context embeddings in tabular learning, demonstrating that string similarity-based encoding outperforms ordinal encoding, especially with transformer architectures in multi-label classification.
Contribution
It introduces a benchmark comparing encoding methods for categorical data, highlighting the superiority of string similarity encoding over ordinal encoding in tabular learning tasks.
Findings
String similarity encoding improves classification accuracy.
Transformers perform better with similarity-based encodings.
Ordinal encoding is less effective for categorical data.
Abstract
Examining the effect of different encoding techniques on entity and context embeddings, the goal of this work is to challenge commonly used Ordinal encoding for tabular learning. Applying different preprocessing methods and network architectures over several datasets resulted in a benchmark on how the encoders influence the learning outcome of the networks. By keeping the test, validation and training data consistent, results have shown that ordinal encoding is not the most suited encoder for categorical data in terms of preprocessing the data and thereafter, classifying the target variable correctly. A better outcome was achieved, encoding the features based on string similarities by computing a similarity matrix as input for the network. This is the case for both, entity and context embeddings, where the transformer architecture showed improved performance for Ordinal and Similarity…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Video Analysis and Summarization
