PyHard: a novel tool for generating hardness embeddings to support data-centric analysis
Pedro Yuri Arbs Paiva, Kate Smith-Miles, Maria Gabriela Valeriano and, Ana Carolina Lorena

TL;DR
PyHard is a new tool that uses Instance Space Analysis to create a visual, interpretable embedding of dataset hardness, helping to evaluate data quality and model performance in machine learning.
Contribution
It introduces a novel hardness embedding technique that visually reveals data and model strengths and weaknesses, aiding data-centric ML analysis.
Findings
Identified hard observation pockets in COVID prognosis data.
Visualized model strengths and weaknesses across dataset regions.
Supported targeted data inspection and model improvement.
Abstract
For building successful Machine Learning (ML) systems, it is imperative to have high quality data and well tuned learning models. But how can one assess the quality of a given dataset? And how can the strengths and weaknesses of a model on a dataset be revealed? Our new tool PyHard employs a methodology known as Instance Space Analysis (ISA) to produce a hardness embedding of a dataset relating the predictive performance of multiple ML models to estimated instance hardness meta-features. This space is built so that observations are distributed linearly regarding how hard they are to classify. The user can visually interact with this embedding in multiple ways and obtain useful insights about data and algorithmic performance along the individual observations of the dataset. We show in a COVID prognosis dataset how this analysis supported the identification of pockets of hard observations…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMachine Learning and Data Classification · Anomaly Detection Techniques and Applications · Machine Learning and Algorithms
