From Strings to Data Science: a Practical Framework for Automated String   Handling

John W. van Lith; Joaquin Vanschoren

arXiv:2111.01868·cs.LG·November 5, 2021·1 cites

From Strings to Data Science: a Practical Framework for Automated String Handling

John W. van Lith, Joaquin Vanschoren

PDF

Open Access

TL;DR

This paper introduces a practical framework for automatically converting categorical string features into numerical data for machine learning, leveraging best practices, domain knowledge, and novel techniques, with an open-source Python implementation.

Contribution

It presents a new framework that automatically identifies and processes different string feature types, improving preprocessing for machine learning models.

Findings

01

Effective automatic identification of string feature types

02

Successful encoding of string features into numerical data

03

Open-source Python tool demonstrated on diverse datasets

Abstract

Many machine learning libraries require that string features be converted to a numerical representation for the models to work as intended. Categorical string features can represent a wide variety of data (e.g., zip codes, names, marital status), and are notoriously difficult to preprocess automatically. In this paper, we propose a framework to do so based on best practices, domain knowledge, and novel techniques. It automatically identifies different types of string features, processes them accordingly, and encodes them into numerical representations. We also provide an open source Python implementation to automatically preprocess categorical string data in tabular datasets and demonstrate promising results on a wide range of datasets.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsData Quality and Management · Topic Modeling · Natural Language Processing Techniques