# Statistical modelling of an outcome variable with integrated multi-omics

**Authors:** He Li, Zander Gu, Said el Bouhaddani, Jeanine Houwing-Duistermaat

PMC · DOI: 10.1186/s12859-025-06349-0 · 2025-12-24

## TL;DR

This paper compares univariate and multivariate methods for integrating multi-omics data to model an outcome variable, showing that multivariate approaches often perform better.

## Contribution

The paper introduces and evaluates two new multivariate methods for integrating multi-omics data in outcome modeling.

## Key findings

- Multivariate methods outperform univariate methods when modeling outcomes from two normally distributed omics datasets.
- All methods perform similarly in real data applications involving metabolomics and genetic datasets.
- Multivariate methods remain effective even with non-normal data, offering a promising alternative to high-dimensional approaches.

## Abstract

In studies that aim to model the relationship between an outcome variable and multiple omics datasets, it is often desirable to reduce the dimensionality of these datasets or to represent one omics dataset in terms of another. Several approaches exist for this purpose, including univariate methods such as polygenic scores, and multivariate methods. Multivariate approaches offer advantages by producing lower-dimensional integrative scores, capturing joint structures across datasets, and filtering out dataset-specific noise. In this paper, we describe one univariate and two multivariate methods, and evaluate their performance through simulations involving two correlated multivariate normally distributed omics datasets, as well as a combination of one multivariate normal and one fixed categorical dataset.

We assess method performance using the root mean squared error (RMSE) when modelling the outcome variable as a function of the reduced omics representations. Multivariate methods generally perform well, particularly when a slightly higher number of components is used for integration. They outperform the univariate method in scenarios involving two normally distributed omics datasets and perform comparably in settings with one normal and one categorical dataset. In real data applications, including two metabolomics datasets from TwinsUK and a metabolomics-genetic dataset from ORCADES, all methods show similar performance in modelling body mass index.

Multivariate methods provide a valuable framework for summarizing multi-omics datasets into low-dimensional components suitable for outcome modelling. Even in the presence of non-normal data, these methods offer a promising alternative to high-dimensional univariate approaches.

The online version contains supplementary material available at 10.1186/s12859-025-06349-0.

## Full-text entities

- **Genes:** MAF (MAF bZIP transcription factor) [NCBI Gene 4094] {aka AYGRP, CCA4, CTRCT21, c-MAF}, CSF2 (colony stimulating factor 2) [NCBI Gene 1437] {aka CSF, GMCSF}
- **Diseases:** Complex Disease (MESH:D048090), PLS (MESH:D004828)
- **Chemicals:** MUFA (MESH:D005229), Brainshake (-), Val (MESH:D014633), DHA (MESH:C027493)
- **Species:** Homo sapiens (human, species) [taxon 9606]

## Figures

4 figures with captions in the complete paper: https://tomesphere.com/paper/PMC12859906/full.md

---
Source: https://tomesphere.com/paper/PMC12859906