K-tree: Large Scale Document Clustering
Christopher M. De Vries, Shlomo Geva

TL;DR
K-tree is a hierarchical clustering algorithm designed for large-scale document collections, offering efficient, scalable clustering with low time complexity and suitability for disk-based implementations.
Contribution
It introduces K-tree, a hierarchical clustering method that approximates k-means, extended for sparse data, and demonstrates its efficiency and scalability in large document retrieval tasks.
Findings
K-tree outperforms CLUTO in clustering quality and speed.
It is suitable for large, disk-resident document collections.
K-tree maintains low time complexity for large datasets.
Abstract
We introduce K-tree in an information retrieval context. It is an efficient approximation of the k-means clustering algorithm. Unlike k-means it forms a hierarchy of clusters. It has been extended to address issues with sparse representations. We compare performance and quality to CLUTO using document collections. The K-tree has a low time complexity that is suitable for large document collections. This tree structure allows for efficient disk based implementations where space requirements exceed that of main memory.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
