Data Aggregation for Hierarchical Clustering

Erich Schubert; Andreas Lang

arXiv:2309.02552·stat.ML·September 7, 2023·Mach. Learn. under Resour. Constraints Vol. 1

Data Aggregation for Hierarchical Clustering

Erich Schubert, Andreas Lang

PDF

1 Repo

TL;DR

This paper introduces a data aggregation method using BETULA, a stable version of BIRCH, to enable hierarchical clustering on resource-constrained systems with minimal loss in clustering quality.

Contribution

It presents a novel approach combining BETULA with HAC to reduce memory and runtime requirements for large-scale hierarchical clustering.

Findings

01

Enables HAC on resource-limited systems

02

Maintains high clustering quality with small data aggregation

03

Reduces memory and computational costs significantly

Abstract

Hierarchical Agglomerative Clustering (HAC) is likely the earliest and most flexible clustering method, because it can be used with many distances, similarities, and various linkage strategies. It is often used when the number of clusters the data set forms is unknown and some sort of hierarchy in the data is plausible. Most algorithms for HAC operate on a full distance matrix, and therefore require quadratic memory. The standard algorithm also has cubic runtime to produce a full hierarchy. Both memory and runtime are especially problematic in the context of embedded or otherwise very resource-constrained systems. In this section, we present how data aggregation with BETULA, a numerically stable version of the well known BIRCH data aggregation algorithm, can be used to make HAC viable on systems with constrained resources with only small losses on clustering quality, and hence allow…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

elki-project/elki
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.