# Estimation and model selection for finite mixtures of Tukey’s g- &-h distributions

**Authors:** Tingting Zhan, Misung Yi, Amy R. Peck, Hallgeir Rui, Inna Chervoneva

PMC · DOI: 10.1007/s11222-025-10596-9 · Statistics and Computing · 2025-03-15

## TL;DR

This paper introduces a flexible statistical model for analyzing protein expression levels in tissues, using a mixture of Tukey’s g- &-h distributions to handle complex data patterns.

## Contribution

The novel contribution is a quantile-based estimation method and a stepwise model selection algorithm for finite mixtures of Tukey’s g- &-h distributions.

## Key findings

- The proposed QLMD estimator effectively fits Tukey’s g- &-h mixtures with both Gaussian and non-Gaussian components.
- The method outperforms skew-normal and skew-t mixtures in modeling real protein expression data.
- Parameter estimates from the model are useful predictors of progression-free survival in breast cancer.

## Abstract

A finite mixture of distributions is a popular statistical model, which is especially meaningful when the population of interest may include distinct subpopulations. This work is motivated by analysis of protein expression levels quantified using immunofluorescence immunohistochemistry assays of human tissues. The distributions of cellular protein expression levels in a tissue often exhibit multimodality, skewness and heavy tails, but there is a substantial variability between distributions in different tissues from different subjects, while some of these mixture distributions include components consistent with the assumption of a normal distribution. To accommodate such diversity, we propose a mixture of 4-parameter Tukey’s g- &-h distributions for fitting finite mixtures with both Gaussian and non-Gaussian components. Tukey’s g- &-h distribution is a flexible model that allows variable degree of skewness and kurtosis in mixture components, including normal distribution as a particular case. Since the likelihood of the Tukey’s g- &-h mixtures does not have a closed analytical form, we propose a quantile least Mahalanobis distance (QLMD) estimator for parameters of such mixtures. QLMD is an indirect estimator minimizing the Mahalanobis distance between the sample and model-based quantiles, and its asymptotic properties follow from the general theory of indirect estimation. We have developed a stepwise algorithm to select a parsimonious Tukey’s g- &-h mixture model and implemented all proposed methods in the R package QuantileGH available on CRAN. A simulation study was conducted to evaluate performance of the Tukey’s g- &-h mixtures and compare to performance of mixtures of skew-normal or skew-t distributions. The Tukey’s g- &-h mixtures were applied to model cellular expressions of Cyclin D1 protein in breast cancer tissues, and resulting parameter estimates evaluated as predictors of progression-free survival.

## Linked entities

- **Proteins:** ccnd1.S (cyclin D1 S homeolog)
- **Diseases:** breast cancer (MONDO:0004989)

## Full-text entities

- **Genes:** CCND1 (cyclin D1) [NCBI Gene 595] {aka BCL1, D11S287E, PRAD1, U21B31}
- **Diseases:** breast cancer (MESH:D001943)
- **Species:** Homo sapiens (human, species) [taxon 9606]

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/PMC11910465/full.md

## Figures

8 figures with captions in the complete paper: https://tomesphere.com/paper/PMC11910465/full.md

## References

5 references — full list in the complete paper: https://tomesphere.com/paper/PMC11910465/full.md

---
Source: https://tomesphere.com/paper/PMC11910465