Archetypal solution spaces for clustering gene expression datasets in identification of cancer subtypes
Yuchen Wu, Luke Dicks, David J. Wales

TL;DR
This paper uses energy landscape theory to analyze the solution space of K-means clustering in gene expression datasets, revealing insights into cancer subtype identification and proposing a new diagnostic metric.
Contribution
It introduces a landscape-based approach to understand K-means clustering performance and proposes a frustration metric as a diagnostic tool for optimal cancer subtype number.
Findings
Single-funnel landscape structure indicates correct number of clusters
Frustration metric correlates with clustering effectiveness
Landscape analysis guides better clustering parameter choices
Abstract
Gene expression profiles are essential in identifying different cancer phenotypes. Clustering gene expression datasets can provide accurate identification of cancerous cell lines, but this task is challenging due to the small sample size and high dimensionality. Using the -means clustering algorithm we determine the organisation of the solution space for a variety of gene expression datasets using energy landscape theory. The solution space landscapes allow us to understand -means performance, and guide more effective use when varying common dataset properties; number of features, number of clusters, and cluster distribution. We find that the landscapes have a single-funnelled structure for the appropriate number of clusters, which is lost when the number of clusters deviates from this. We quantify this landscape structure using a frustration metric and show that it may provide a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsBioinformatics and Genomic Networks · Gene expression and cancer classification · Machine Learning in Bioinformatics
