# Feature Screening in Large Scale Cluster Analysis

**Authors:** Trambak Banerjee, Gourab Mukherjee, Peter Radchenko

arXiv: 1701.02857 · 2017-10-05

## TL;DR

This paper introduces a fast feature screening method for large-scale clustering, effectively discarding non-informative features in massive datasets using a fusion penalization approach, supported by theoretical bounds and empirical validation.

## Contribution

The paper presents a novel, efficient feature screening technique for clustering large datasets, with theoretical guarantees and demonstrated effectiveness in high-dimensional applications.

## Key findings

- High probability of perfect screening of noise features
- Competitive performance in simulation studies
- Effective application to single-cell gene expression data

## Abstract

We propose a novel methodology for feature screening in clustering massive datasets, in which both the number of features and the number of observations can potentially be very large. Taking advantage of a fusion penalization based convex clustering criterion, we propose a very fast screening procedure that efficiently discards non-informative features by first computing a clustering score corresponding to the clustering tree constructed for each feature, and then thresholding the resulting values. We provide theoretical support for our approach by establishing uniform non-asymptotic bounds on the clustering scores of the "noise" features. These bounds imply perfect screening of non-informative features with high probability and are derived via careful analysis of the empirical processes corresponding to the clustering trees that are constructed for each of the features by the associated clustering procedure. Through extensive simulation experiments we compare the performance of our proposed method with other screening approaches, popularly used in cluster analysis, and obtain encouraging results. We demonstrate empirically that our method is applicable to cluster analysis of big datasets arising in single-cell gene expression studies.

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/1701.02857/full.md

## Figures

7 figures with captions in the complete paper: https://tomesphere.com/paper/1701.02857/full.md

## References

58 references — full list in the complete paper: https://tomesphere.com/paper/1701.02857/full.md

---
Source: https://tomesphere.com/paper/1701.02857