Assessing Dataset Quality Through Decision Tree Characteristics in Autoencoder-Processed Spaces
Szymon Mazurek, Maciej Wielgosz

TL;DR
This paper presents a framework for assessing dataset quality in classification tasks by analyzing decision tree characteristics in autoencoder-processed feature spaces, emphasizing the importance of data quality, feature selection, and data volume.
Contribution
It introduces a novel framework for dataset quality assessment using decision tree analysis in autoencoder spaces, validated across diverse datasets with varying complexity.
Findings
Dataset quality significantly affects model performance.
Feature selection and data volume are critical for high accuracy.
Autoencoder-based analysis helps identify data issues.
Abstract
In this paper, we delve into the critical aspect of dataset quality assessment in machine learning classification tasks. Leveraging a variety of nine distinct datasets, each crafted for classification tasks with varying complexity levels, we illustrate the profound impact of dataset quality on model training and performance. We further introduce two additional datasets designed to represent specific data conditions - one maximizing entropy and the other demonstrating high redundancy. Our findings underscore the importance of appropriate feature selection, adequate data volume, and data quality in achieving high-performing machine learning models. To aid researchers and practitioners, we propose a comprehensive framework for dataset quality assessment, which can help evaluate if the dataset at hand is sufficient and of the required quality for specific tasks. This research offers…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMachine Learning and Data Classification · Anomaly Detection Techniques and Applications · Data Stream Mining Techniques
