Normalization and selecting non-differentially expressed genes improve machine learning modelling of cross-platform transcriptomic data

Fei Deng (1); Catherine H Feng (1; 2); Nan Gao (3; 4); Lanjing Zhang (1; 4; 5; 6) ((1) Department of Chemical Biology; Ernest Mario School of Pharmacy; Rutgers University; Piscataway; NJ; (2) Harvard University; Cambridge; MA; (3) Department of Biological Sciences; School of Arts & Sciences; Rutgers University; Newark; NJ; (4) Department of Pharmacology; Physiology; and Neuroscience; New Jersey Medical School; Rutgers University; Newark; NJ; (5) Department of Pathology; Princeton Medical Center; Plainsboro; NJ; (6) Rutgers Cancer Institute of New Jersey; New Brunswick; NJ.)

arXiv:2501.14248·q-bio.QM·May 30, 2025

Normalization and selecting non-differentially expressed genes improve machine learning modelling of cross-platform transcriptomic data

Fei Deng (1), Catherine H Feng (1, 2), Nan Gao (3, 4), Lanjing Zhang (1, 4, 5, 6) ((1) Department of Chemical Biology, Ernest Mario School of Pharmacy, Rutgers University, Piscataway, NJ, (2) Harvard University, Cambridge, MA, (3) Department of Biological Sciences

PDF

TL;DR

This study demonstrates that selecting non-differentially expressed genes and using nonparametric normalization methods can significantly enhance machine learning classification performance across independent transcriptomic datasets from different platforms.

Contribution

It introduces the use of non-differentially expressed genes for normalization to improve cross-platform ML modeling of transcriptomic data, validated on independent breast cancer datasets.

Findings

01

NDEG-based normalization improves classification accuracy

02

Nonparametric normalization methods outperform parametric ones

03

LOG_QN and LOG_QNZ methods with neural networks yield better results

Abstract

Normalization is a critical step in quantitative analyses of biological processes. Recent works show that cross-platform integration and normalization enable machine learning (ML) training on RNA microarray and RNA-seq data, but no independent datasets were used in their studies. Therefore, it is unclear how to improve ML modelling performance on independent RNA array and RNA-seq based datasets. Inspired by the house-keeping genes that are commonly used in experimental biology, this study tests the hypothesis that non-differentially expressed genes (NDEG) may improve normalization of transcriptomic data and subsequently cross-platform modelling performance of ML models. Microarray and RNA-seq datasets of the TCGA breast cancer were used as independent training and test datasets, respectively, to classify the molecular subtypes of breast cancer. NDEG (p>0.85) and differentially expressed…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.