From Strings to Data Science: a Practical Framework for Automated String Handling
John W. van Lith, Joaquin Vanschoren

TL;DR
This paper introduces a practical framework for automatically converting categorical string features into numerical data for machine learning, leveraging best practices, domain knowledge, and novel techniques, with an open-source Python implementation.
Contribution
It presents a new framework that automatically identifies and processes different string feature types, improving preprocessing for machine learning models.
Findings
Effective automatic identification of string feature types
Successful encoding of string features into numerical data
Open-source Python tool demonstrated on diverse datasets
Abstract
Many machine learning libraries require that string features be converted to a numerical representation for the models to work as intended. Categorical string features can represent a wide variety of data (e.g., zip codes, names, marital status), and are notoriously difficult to preprocess automatically. In this paper, we propose a framework to do so based on best practices, domain knowledge, and novel techniques. It automatically identifies different types of string features, processes them accordingly, and encodes them into numerical representations. We also provide an open source Python implementation to automatically preprocess categorical string data in tabular datasets and demonstrate promising results on a wide range of datasets.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsData Quality and Management · Topic Modeling · Natural Language Processing Techniques
