Clustering and Classification of Genetic Data Through U-Statistics
Gabriela Bettella Cybis, Marcio Valk, Silvia Regina Costa Lopes

TL;DR
This paper introduces a versatile nonparametric clustering and classification framework based on U-statistics for complex genetic data, with statistical tests for homogeneity and significance, validated through simulations and real datasets.
Contribution
It develops a novel U-statistics based approach for clustering and classification of genetic data, including new tests for homogeneity and significance, optimized for complex dependence structures.
Findings
The proposed tests effectively assess group homogeneity and classification significance.
Simulation studies demonstrate high power and controlled size of the tests.
Applications to real genetic datasets show the method's versatility and biological relevance.
Abstract
Genetic data are frequently categorical and have complex dependence structures that are not always well understood. For this reason, clustering and classification based on genetic data, while highly relevant, are challenging statistical problems. Here we consider a highly versatile U-statistics based approach built on dissimilarities between pairs of data points for nonparametric clustering. In this work we propose statistical tests to assess group homogeneity taking into account the multiple testing issues, and a clustering algorithm based on dissimilarities within and between groups that highly speeds up the homogeneity test. We also propose a test to verify classification significance of a sample in one of two groups. A Monte Carlo simulation study is presented to evaluate power of the classification test, considering different group sizes and degree of separation. Size and power of…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGene expression and cancer classification · Bioinformatics and Genomic Networks · Bayesian Methods and Mixture Models
