Exact clustering in linear time

Jonathan A. Marshall; Lawrence C. Rafsky

arXiv:1702.05425·cs.DS·February 28, 2017

Exact clustering in linear time

Jonathan A. Marshall, Lawrence C. Rafsky

PDF

Open Access 1 Repo

TL;DR

This paper introduces MIMOSA, a novel algorithm class that achieves exact clustering in linear time, enabling efficient processing of large datasets without probabilistic methods.

Contribution

The paper presents MIMOSA, a new class of algorithms that perform exact clustering in linear time using hashing techniques, overcoming the quadratic complexity barrier.

Findings

01

MIMOSA clusters 10 million news articles in significantly less time than standard methods.

02

MIMOSA achieves over four orders of magnitude speedup in clustering large datasets.

03

The algorithm provides exact, error-free clustering results.

Abstract

The time complexity of data clustering has been viewed as fundamentally quadratic, slowing with the number of data items, as each item is compared for similarity to preceding items. Clustering of large data sets has been infeasible without resorting to probabilistic methods or to capping the number of clusters. Here we introduce MIMOSA, a novel class of algorithms which achieve linear time computational complexity on clustering tasks. MIMOSA algorithms mark and match partial-signature keys in a hash table to obtain exact, error-free cluster retrieval. Benchmark measurements, on clustering a data set of 10,000,000 news articles by news topic, found that a MIMOSA implementation finished more than four orders of magnitude faster than a standard centroid implementation.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

jam-git/mimosa
none

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Clustering Algorithms Research · Complex Network Analysis Techniques · Data Management and Algorithms