# An integrative multiomics random forest framework for robust biomarker discovery

**Authors:** Wei Zhang, Hanchen Huang, Lily Wang, Brian D Lehmann, X Steven Chen

PMC · DOI: 10.1093/gigascience/giaf148 · GigaScience · 2025-12-09

## TL;DR

This paper introduces a new random forest method for combining multiomics data to find robust biomarkers, especially in nonlinear and interactive settings.

## Contribution

The novel MRF-IMD framework uses inverse minimal depth importance for multivariate, unsupervised integration of multiomics data.

## Key findings

- MRF-IMD outperforms linear methods in nonlinear and interaction-driven simulations.
- In TCGA cancer data, MRF-IMD identifies pathway-enriched biomarkers with better survival stratification.
- MRF-IMD achieves higher clustering accuracy in pan-cancer and Alzheimer’s data.

## Abstract

High-throughput technologies now produce a wide array of omics data, from genomic and transcriptomic profiles to epigenomic and proteomic measurements. Integrating multiple omics layers measured on the same samples can reveal cross-layer molecular hubs that single-layer analyses miss. However, many existing integrative methods rely on linear assumptions or univariate feature importance, limiting their ability to capture nonlinear and interaction-driven dependencies across data modalities.

We present an unsupervised, multivariate random forest (MRF) framework with an inverse minimal depth (IMD) importance to prioritize shared biomarkers across omics. In each forest, one layer serves as a multivariate response and the other as predictors; IMD summarizes how early a predictor (or response maximal splitting response variable) appears across trees, yielding interpretable, cross-layer feature rankings. We provide two IMD-based selection strategies and introduce an optional IMD power transform to enhance sensitivity to interaction signals. In extensive simulations spanning linear, nonlinear, and interaction regimes, our method matches sparse partial least squares/canonical correlation analysis under linear settings and outperforms them as nonlinearity increases, while adapted univariate ensemble learners (random forest, gradient boosting machine, XGBoost) underperform in the multivariate, unsupervised context. Applied to breast invasive carcinoma and colon adenocarcinoma in The Cancer Genome Atlas (TCGA), MRF-IMD identifies genes, CpGs, and microRNAs enriched for cancer-relevant pathways and yields more robust survival stratification than linear integrators with matched model sizes. In a TCGA pan-cancer analysis, MRF-IMD features achieve a higher Adjusted Rand Index than alternatives and recover coherent tumor-type clusters; in the Alzheimer’s Disease Neuroimaging Initiative (ADNI), the integrative signature improves dementia progression stratification over a published methylation risk score.

MRF-IMD provides a scalable and interpretable framework for multiomics integration that reliably identifies cross-layer biomarkers when nonlinear and interaction-driven dependencies are present. This approach advances robust biomarker discovery beyond the limits of linear integrative methods.

## Linked entities

- **Diseases:** colon adenocarcinoma (MONDO:0002271), dementia (MONDO:0001627)

## Full-text entities

- **Genes:** BRCA1 (BRCA1 DNA repair associated) [NCBI Gene 672] {aka BRCAI, BRCC1, BROVCA1, FANCS, IRIS, PNCA4}
- **Diseases:** dementia (MESH:D003704), COAD (MESH:D029424), cancer (MESH:D009369)

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/PMC12821379/full.md

## Figures

5 figures with captions in the complete paper: https://tomesphere.com/paper/PMC12821379/full.md

## References

60 references — full list in the complete paper: https://tomesphere.com/paper/PMC12821379/full.md

---
Source: https://tomesphere.com/paper/PMC12821379