# Hybridized Threshold Clustering for Massive Data

**Authors:** Jianmei Luo, ChandraVyas Annakula, Aruna Sai Kannamareddy, Jasjeet S., Sekhon, William Henry Hsu, Michael Higgins

arXiv: 1907.02907 · 2019-07-08

## TL;DR

This paper introduces IHTC, a hybrid clustering approach that reduces computational costs for massive datasets by iteratively applying threshold clustering and then refining with traditional algorithms, maintaining performance.

## Contribution

The paper proposes a novel iterative hybridized threshold clustering method that significantly improves efficiency for large-scale data clustering while preserving accuracy.

## Key findings

- IHTC reduces runtime and memory usage of standard clustering algorithms.
- IHTC prevents overfitting of singular data points.
- Experimental results confirm the effectiveness of IHTC on real datasets.

## Abstract

As the size $n$ of datasets become massive, many commonly-used clustering algorithms (for example, $k$-means or hierarchical agglomerative clustering (HAC) require prohibitive computational cost and memory. In this paper, we propose a solution to these clustering problems by extending threshold clustering (TC) to problems of instance selection. TC is a recently developed clustering algorithm designed to partition data into many small clusters in linearithmic time (on average). Our proposed clustering method is as follows. First, TC is performed and clusters are reduced into single "prototype" points. Then, TC is applied repeatedly on these prototype points until sufficient data reduction has been obtained. Finally, a more sophisticated clustering algorithm is applied to the reduced prototype points, thereby obtaining a clustering on all $n$ data points. This entire procedure for clustering is called iterative hybridized threshold clustering (IHTC). Through simulation results and by applying our methodology on several real datasets, we show that IHTC combined with $k$-means or HAC substantially reduces the run time and memory usage of the original clustering algorithms while still preserving their performance. Additionally, IHTC helps prevent singular data points from being overfit by clustering algorithms.

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/1907.02907/full.md

## Figures

39 figures with captions in the complete paper: https://tomesphere.com/paper/1907.02907/full.md

## References

42 references — full list in the complete paper: https://tomesphere.com/paper/1907.02907/full.md

---
Source: https://tomesphere.com/paper/1907.02907