Impact of Data Pruning on Machine Learning Algorithm Performance
Arun Thundyill Saseendran, Lovish Setia, Viren Chhabria, Debrup, Chakraborty, Aneek Barman Roy

TL;DR
This study investigates how dataset pruning affects the performance of different machine learning algorithms in predicting movie ratings, revealing that the relative performance order remains consistent after pruning.
Contribution
It provides a comparative analysis of algorithm performance on pruned versus unpruned datasets using IMDb movie ratings, highlighting the impact of data pruning.
Findings
Better unpruned algorithm remains better after pruning
Pruning does not significantly change algorithm rankings
Dataset pruning improves model efficiency without altering performance hierarchy
Abstract
Dataset pruning is the process of removing sub-optimal tuples from a dataset to improve the learning of a machine learning model. In this paper, we compared the performance of different algorithms, first on an unpruned dataset and then on an iteratively pruned dataset. The goal was to understand whether an algorithm (say A) on an unpruned dataset performs better than another algorithm (say B), will algorithm B perform better on the pruned data or vice-versa. The dataset chosen for our analysis is a subset of the largest movie ratings database publicly available on the internet, IMDb [1]. The learning objective of the model was to predict the categorical rating of a movie among 5 bins: poor, average, good, very good, excellent. The results indicated that an algorithm that performed better on an unpruned dataset also performed better on a pruned dataset.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMachine Learning and Data Classification · Anomaly Detection Techniques and Applications · Imbalanced Data Classification Techniques
MethodsPruning
