# A flexible framework for minimal biomarker signature discovery from clinical omics studies without library size normalisation

**Authors:** Daniel Rawlinson, Chenxi Zhou, Myrsini Kaforou, Kim-Anh Lê Cao, Lachlan J. M. Coin, Hisham Hasan, Hisham Hasan

PMC · DOI: 10.1371/journal.pdig.0000780 · PLOS Digital Health · 2025-03-26

## TL;DR

This paper introduces FS-PLS, a method to identify small, accurate biomarker sets for disease prediction without needing library size normalization.

## Contribution

FS-PLS enables discovery of minimal, high-performance biomarker signatures without relying on library size normalization.

## Key findings

- FS-PLS generates signatures an order of magnitude smaller than existing methods while maintaining comparable performance.
- Selected features can predict library size, allowing normalization of unseen samples using only a few molecules.
- Minimal gene signatures retain nearly all the accuracy of larger models.

## Abstract

Application of transcriptomics, proteomics and metabolomics technologies to clinical cohorts has uncovered a variety of signatures for predicting disease. Many of these signatures require the full ‘omics data for evaluation on unseen samples, either explicitly or implicitly through library size normalisation. Translation to low-cost point-of-care tests requires development of signatures which measure as few analytes as possible without relying on direct measurement of library size. To achieve this, we have developed a feature selection method (Forward Selection-Partial Least Squares) which generates minimal disease signatures from high-dimensional omics datasets with applicability to continuous, binary or multi-class outcomes. Through extensive benchmarking, we show that FS-PLS has comparable performance to commonly used signature discovery methods while delivering signatures which are an order of magnitude smaller. We show that FS-PLS can be used to select features predictive of library size, and that these features can be used to normalize unseen samples, meaning that the features in the complete model can be measured in isolation for making new predictions. By enabling discovery of small, high-performance signatures, FS-PLS addresses an important impediment for the further development of precision medical care.

High-throughput sequencing technologies are widely used in clinical studies to measure expression levels of many thousands of different types of molecules in order to develop improved models for predicting disease state and progression. However, low-cost diagnostic assays can only measure a handful of molecules. We have developed a framework, called FS-PLS, for identifying minimal sets of biomarkers for predicting disease state. Here we show that the minimal gene signatures retain almost all of the accuracy of larger models. Additionally, translation of models developed from high-throughput datasets typically require correction for the total number of molecules sequenced, referred to as library size. We show that FS-PLS can also be used to obtain reliable predictions of library size from measurement of only a few molecules.

## Full-text entities

- **Diseases:** blood cancer (MESH:D019337), AML (MESH:D015470), cancer (MESH:D009369), Myeloma (MESH:D009101), RAPIDS (MESH:C564983), CLL (MESH:D015451), respiratory illness (MESH:D012140), TB (MESH:D014390), FS (MESH:D009155), Sepsis (MESH:D018805), Bacterial infection (MESH:D001424), Tuberculosis infection (MESH:D014376), ALL (MESH:D054198), COVID-19 (MESH:D000086382), PLS (MESH:D019292), Infection (MESH:D007239), Viral infection (MESH:D014777)
- **Chemicals:** FS (-)
- **Species:** Homo sapiens (human, species) [taxon 9606]

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/PMC11942414/full.md

## Figures

5 figures with captions in the complete paper: https://tomesphere.com/paper/PMC11942414/full.md

## References

49 references — full list in the complete paper: https://tomesphere.com/paper/PMC11942414/full.md

---
Source: https://tomesphere.com/paper/PMC11942414