The Utility of Clustering in Prediction Tasks
Shubhendu Trivedi, Zachary A. Pardos, Neil T. Heffernan

TL;DR
This paper investigates how clustering can be directly used to enhance prediction accuracy across various datasets by combining predictions from clustered data, showing consistent improvements over standard methods.
Contribution
It provides a detailed analysis of using clustering as a preprocessing step to improve prediction accuracy and demonstrates its effectiveness across multiple datasets and models.
Findings
Clustering improves prediction accuracy in most datasets.
Combining clustered predictions outperforms individual predictors.
Method enhances even Random Forests predictions.
Abstract
We explore the utility of clustering in reducing error in various prediction tasks. Previous work has hinted at the improvement in prediction accuracy attributed to clustering algorithms if used to pre-process the data. In this work we more deeply investigate the direct utility of using clustering to improve prediction accuracy and provide explanations for why this may be so. We look at a number of datasets, run k-means at different scales and for each scale we train predictors. This produces k sets of predictions. These predictions are then combined by a na\"ive ensemble. We observed that this use of a predictor in conjunction with clustering improved the prediction accuracy in most datasets. We believe this indicates the predictive utility of exploiting structure in the data and the data compression handed over by clustering. We also found that using this method improves upon the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Machine Learning and Algorithms
