K-tree: Large Scale Document Clustering

Christopher M. De Vries; Shlomo Geva

arXiv:1001.0830·cs.IR·January 7, 2010

K-tree: Large Scale Document Clustering

Christopher M. De Vries, Shlomo Geva

PDF

TL;DR

K-tree is a hierarchical clustering algorithm designed for large-scale document collections, offering efficient, scalable clustering with low time complexity and suitability for disk-based implementations.

Contribution

It introduces K-tree, a hierarchical clustering method that approximates k-means, extended for sparse data, and demonstrates its efficiency and scalability in large document retrieval tasks.

Findings

01

K-tree outperforms CLUTO in clustering quality and speed.

02

It is suitable for large, disk-resident document collections.

03

K-tree maintains low time complexity for large datasets.

Abstract

We introduce K-tree in an information retrieval context. It is an efficient approximation of the k-means clustering algorithm. Unlike k-means it forms a hierarchy of clusters. It has been extended to address issues with sparse representations. We compare performance and quality to CLUTO using document collections. The K-tree has a low time complexity that is suitable for large document collections. This tree structure allows for efficient disk based implementations where space requirements exceed that of main memory.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.