Classifier Data Quality: A Geometric Complexity Based Method for   Automated Baseline And Insights Generation

George Kour; Marcel Zalmanovici; Orna Raz; Samuel Ackerman; Ateret; Anaby-Tavor

arXiv:2112.11832·cs.LG·October 28, 2022·1 cites

Classifier Data Quality: A Geometric Complexity Based Method for Automated Baseline And Insights Generation

George Kour, Marcel Zalmanovici, Orna Raz, Samuel Ackerman, Ateret, Anaby-Tavor

PDF

Open Access

TL;DR

This paper introduces geometric complexity measures to evaluate data difficulty in classification tasks, enabling automatic baseline setting and insights into potential misclassification regions, applicable across different models.

Contribution

The paper proposes novel complexity measures that quantify observation difficulty and outperform simple baselines, providing explainable insights regardless of the classifier used.

Findings

01

Complexity measures identify data regions prone to misclassification.

02

Measures outperform simple baselines in accuracy and explainability.

03

Effective on both synthetic and real chatbot data.

Abstract

Testing Machine Learning (ML) models and AI-Infused Applications (AIIAs), or systems that contain ML models, is highly challenging. In addition to the challenges of testing classical software, it is acceptable and expected that statistical ML models sometimes output incorrect results. A major challenge is to determine when the level of incorrectness, e.g., model accuracy or F1 score for classifiers, is acceptable and when it is not. In addition to business requirements that should provide a threshold, it is a best practice to require any proposed ML solution to out-perform simple baseline models, such as a decision tree. We have developed complexity measures, which quantify how difficult given observations are to assign to their true class label; these measures can then be used to automatically determine a baseline performance threshold. These measures are superior to the best…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsStatistical and Computational Modeling · Machine Learning and Data Classification · Data Mining Algorithms and Applications