$k$-POD: A Method for $k$-Means Clustering of Missing Data

Jocelyn T. Chi; Eric C. Chi; Richard G. Baraniuk

arXiv:1411.7013·stat.CO·June 7, 2018

$k$-POD: A Method for $k$-Means Clustering of Missing Data

Jocelyn T. Chi, Eric C. Chi, Richard G. Baraniuk

PDF

TL;DR

The paper introduces $k$-POD, an extension of $k$-means clustering designed to handle missing data effectively without requiring data imputation or external information, even with high missingness.

Contribution

The novel $k$-POD method enables $k$-means clustering directly on incomplete data, addressing limitations of existing imputation-based approaches.

Findings

01

Works with unknown missingness mechanisms

02

Effective with high levels of missing data

03

Avoids costly data imputation

Abstract

The $k$ -means algorithm is often used in clustering applications but its usage requires a complete data matrix. Missing data, however, is common in many applications. Mainstream approaches to clustering missing data reduce the missing data problem to a complete data formulation through either deletion or imputation but these solutions may incur significant costs. Our $k$ -POD method presents a simple extension of $k$ -means clustering for missing data that works even when the missingness mechanism is unknown, when external information is unavailable, and when there is significant missingness in the data.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.