ptype-cat: Inferring the Type and Values of Categorical Variables
Taha Ceritli, Christopher K. I. Williams

TL;DR
This paper introduces ptype-cat, a probabilistic method for inferring the type and possible values of categorical variables, including non-Boolean categories, improving automatic data annotation.
Contribution
The paper presents ptype-cat, a novel approach that extends existing type inference to accurately identify and extract non-Boolean categorical variables and their values.
Findings
Achieves higher accuracy than existing methods in identifying categorical types.
Effectively detects the possible values of categorical variables.
Outperforms baseline approaches in empirical evaluations.
Abstract
Type inference is the task of identifying the type of values in a data column and has been studied extensively in the literature. Most existing type inference methods support data types such as Boolean, date, float, integer and string. However, these methods do not consider non-Boolean categorical variables, where there are more than two possible values encoded by integers or strings. Therefore, such columns are annotated either as integer or string rather than categorical, and need to be transformed into categorical manually by the user. In this paper, we propose a probabilistic type inference method that can identify the general categorical data type (including non-Boolean variables). Additionally, we identify the possible values of each categorical variable by adapting the existing type inference method ptype. Combining these methods, we present ptype-cat which achieves better…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsData Quality and Management · Data Mining Algorithms and Applications · Advanced Database Systems and Queries
