Towards Modeling Data Quality and Machine Learning Model Performance
Usman Anjum, Chris Trentman, Elrod Caden, Justin Zhan

TL;DR
This paper introduces a new metric called DDR to quantify data uncertainties and noise, helping to better understand and predict machine learning model performance using synthetic data experiments.
Contribution
The paper proposes the DDR metric based on SNR to model the impact of data noise on machine learning accuracy, a novel approach for performance analysis.
Findings
DDR correlates with accuracy changes in synthetic data
DDR-accuracy curves can predict model performance
The approach enhances understanding of data quality effects
Abstract
Understanding the effect of uncertainty and noise in data on machine learning models (MLM) is crucial in developing trust and measuring performance. In this paper, a new model is proposed to quantify uncertainties and noise in data on MLMs. Using the concept of signal-to-noise ratio (SNR), a new metric called deterministic-non-deterministic ratio (DDR) is proposed to formulate performance of a model. Using synthetic data in experiments, we show how accuracy can change with DDR and how we can use DDR-accuracy curves to determine performance of a model.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsData Quality and Management · Advanced Database Systems and Queries · Data Mining Algorithms and Applications
