Parsed Categoric Encodings with Automunge
Nicholas J. Teague

TL;DR
Automunge is an open-source Python library that automates feature engineering for tabular data, including encoding categorical strings by extracting structure and overlaps to improve machine learning readiness.
Contribution
The paper introduces novel automated string parsing methods within Automunge to extract structure from categorical string sets, enhancing encoding without human intervention.
Findings
Automunge effectively extracts structure from categorical strings.
Encoding methods improve machine learning data preparation.
Automated parsing reduces manual feature engineering effort.
Abstract
The Automunge open source python library platform for tabular data pre-processing automates feature engineering data transformations of numerical encoding and missing data infill to received tidy data on bases fit to properties of columns in a designated train set for consistent and efficient application to subsequent data pipelines such as for inference, where transformations may be applied to distinct columns in "family tree" sets with generations and branches of derivations. Included in the library of transformations are methods to extract structure from bounded categorical string sets by way of automated string parsing, in which comparisons between entries in the set of unique values are parsed to identify character subset overlaps which may be encoded by appended columns of boolean overlap detection activations or by replacing string entries with identified overlap partitions.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Rough Sets and Fuzzy Logic · Machine Learning and Data Classification
