Classifier Data Quality: A Geometric Complexity Based Method for Automated Baseline And Insights Generation
George Kour, Marcel Zalmanovici, Orna Raz, Samuel Ackerman, Ateret, Anaby-Tavor

TL;DR
This paper introduces geometric complexity measures to evaluate data difficulty in classification tasks, enabling automatic baseline setting and insights into potential misclassification regions, applicable across different models.
Contribution
The paper proposes novel complexity measures that quantify observation difficulty and outperform simple baselines, providing explainable insights regardless of the classifier used.
Findings
Complexity measures identify data regions prone to misclassification.
Measures outperform simple baselines in accuracy and explainability.
Effective on both synthetic and real chatbot data.
Abstract
Testing Machine Learning (ML) models and AI-Infused Applications (AIIAs), or systems that contain ML models, is highly challenging. In addition to the challenges of testing classical software, it is acceptable and expected that statistical ML models sometimes output incorrect results. A major challenge is to determine when the level of incorrectness, e.g., model accuracy or F1 score for classifiers, is acceptable and when it is not. In addition to business requirements that should provide a threshold, it is a best practice to require any proposed ML solution to out-perform simple baseline models, such as a decision tree. We have developed complexity measures, which quantify how difficult given observations are to assign to their true class label; these measures can then be used to automatically determine a baseline performance threshold. These measures are superior to the best…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsStatistical and Computational Modeling · Machine Learning and Data Classification · Data Mining Algorithms and Applications
