Quality of Data in Machine Learning
Antti Kariluoto, Arto P\"arn\"anen, Joni Kultanen, Jukka Soininen,, Pekka Abrahamsson

TL;DR
This paper challenges the common belief that more data always improves machine learning performance, showing that data quality is more crucial than quantity, based on empirical experiments with vocational student data.
Contribution
It provides empirical evidence that increasing data quantity does not necessarily enhance model accuracy, emphasizing the importance of data quality over quantity.
Findings
Increasing data records does not significantly improve accuracy.
Variance of model accuracy decreases with ensemble models.
Data quality is more important than data quantity for model performance.
Abstract
A common assumption exists according to which machine learning models improve their performance when they have more data to learn from. In this study, the authors wished to clarify the dilemma by performing an empirical experiment utilizing novel vocational student data. The experiment compared different machine learning algorithms while varying the number of data and feature combinations available for training and testing the models. The experiment revealed that the increase of data records or their sample frequency does not immediately lead to significant increases in the model accuracies or performance, however the variance of accuracies does diminish in the case of ensemble models. Similar phenomenon was witnessed while increasing the number of input features for the models. The study refutes the starting assumption and continues to state that in this case the significance in data…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMachine Learning and Data Classification
