Cutting the Unnecessary Long Tail: Cost-Effective Big Data Clustering in the Cloud
Dongwei Li, Shuliang Wang, Nan Gao, Qiang He, Yun Yang

TL;DR
This paper introduces a cost-effective clustering method for big data in the cloud by early stopping of algorithms based on a trained regression model, significantly reducing computation costs while maintaining high accuracy.
Contribution
It proposes a novel early stopping approach for clustering algorithms using regression models trained on sampled data, optimizing cost and accuracy in cloud environments.
Findings
Achieves high cost savings in cloud clustering tasks.
Early stopping maintains 99% accuracy with substantially less computation.
Demonstrates practical savings, e.g., up to $94,687.49 in land use classification.
Abstract
Clustering big data often requires tremendous computational resources where cloud computing is undoubtedly one of the promising solutions. However, the computation cost in the cloud can be unexpectedly high if it cannot be managed properly. The long tail phenomenon has been observed widely in the big data clustering area, which indicates that the majority of time is often consumed in the middle to late stages in the clustering process. In this research, we try to cut the unnecessary long tail in the clustering process to achieve a sufficiently satisfactory accuracy at the lowest possible computation cost. A novel approach is proposed to achieve cost-effective big data clustering in the cloud. By training the regression model with the sampling data, we can make widely used k-means and EM (Expectation-Maximization) algorithms stop automatically at an early point when the desired accuracy…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsData Stream Mining Techniques · Advanced Clustering Algorithms Research · Data Mining Algorithms and Applications
