Pooled variable scaling for cluster analysis

Jakob Raymaekers; Ruben H. Zamar

arXiv:1912.10492·stat.ME·July 28, 2020·Bioinform.

Pooled variable scaling for cluster analysis

Jakob Raymaekers, Ruben H. Zamar

PDF

TL;DR

This paper introduces a novel pooled variance-based scaling method for cluster analysis that preserves the influence of informative variables, validated through simulations and real genomic data applications.

Contribution

The paper presents a new scaling approach based on pooled variance that improves clustering performance by maintaining the effect of informative variables.

Findings

01

The proposed scaling method outperforms traditional methods in simulations.

02

It is effective in high-dimensional genomic data clustering.

03

The method is safe and generally useful across various datasets.

Abstract

We propose a new approach for scaling prior to cluster analysis based on the concept of pooled variance. Unlike available scaling procedures such as the standard deviation and the range, our proposed scale avoids dampening the beneficial effect of informative clustering variables. We confirm through an extensive simulation study and applications to well known real data examples that the proposed scaling method is safe and generally useful. Finally, we use our approach to cluster a high dimensional genomic dataset consisting of gene expression data for several specimens of breast cancer cells tissue.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.