Regularized k-POD: Sparse k-means clustering for high-dimensional missing data

Xin Guan; Yoshikazu Terada

arXiv:2507.11884·stat.ME·July 17, 2025

Regularized k-POD: Sparse k-means clustering for high-dimensional missing data

Xin Guan, Yoshikazu Terada

PDF

Open Access

TL;DR

This paper introduces a regularized k-POD method that enhances high-dimensional missing data clustering by reducing bias through feature-wise regularization, maintaining efficiency and flexibility.

Contribution

It proposes the first bias-mitigating regularized k-POD approach for high-dimensional missing data clustering, improving accuracy over existing methods.

Findings

01

Effectively reduces bias in high-dimensional missing data clustering.

02

Improves clustering accuracy in simulations and real-world data.

03

Maintains computational efficiency and flexibility.

Abstract

The classical k-means clustering, based on distances computed from all data features, cannot be directly applied to incomplete data with missing values. A natural extension of k-means to missing data, namely k-POD, uses only the observed entries for clustering and is both computationally efficient and flexible. However, for high-dimensional missing data including features irrelevant to the underlying cluster structure, the presence of such irrelevant features leads to the bias of k-POD in estimating cluster centers, thereby damaging its clustering effect. Nevertheless, the existing k-POD method performs well in low-dimensional cases, highlighting the importance of addressing the bias issue. To this end, in this paper, we propose a regularized k-POD clustering method that applies feature-wise regularization on cluster centers into the existing k-POD clustering. Such a penalty on cluster…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsFace and Expression Recognition · Advanced Clustering Algorithms Research · Bayesian Methods and Mixture Models