Learning Interesting Categorical Attributes for Refined Data Exploration
Koninika Pal, Sebastian Michel

TL;DR
This paper introduces a machine learning approach to identify interesting categorical attributes for data exploration, using web tables for training and new statistical measures to improve filtering relevance.
Contribution
It presents a novel classifier trained on web tables to determine attribute interestingness, incorporating new statistical measures for better data distribution capture.
Findings
The classifier effectively predicts interesting attributes based on user relevance assessments.
Proposed diversity measures outperform traditional entropy in identifying interesting categories.
User study confirms the approach's practical applicability in data exploration.
Abstract
This work proposes and evaluates a novel approach to determine interesting categorical attributes for lists of entities. Once identified, such categories are of immense value to allow constraining (filtering) a current view of a user to subsets of entities. We show how a classifier is trained that is able to tell whether or not a categorical attribute can act as a constraint, in the sense of human-perceived interestingness. The training data is harnessed from Web tables, treating the presence or absence of a table as an indication that the attribute used as a filter constraint is reasonable or not. For learning the classification model, we review four well-known statistical measures (features) for categorical attributes---entropy, unalikeability, peculiarity, and coverage. We additionally propose three new statistical measures to capture the distribution of data, tailored to our main…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsData Quality and Management · Data Management and Algorithms · Time Series Analysis and Forecasting
