Normalization and selecting non-differentially expressed genes improve machine learning modelling of cross-platform transcriptomic data
Fei Deng (1), Catherine H Feng (1, 2), Nan Gao (3, 4), Lanjing Zhang (1, 4, 5, 6) ((1) Department of Chemical Biology, Ernest Mario School of Pharmacy, Rutgers University, Piscataway, NJ, (2) Harvard University, Cambridge, MA, (3) Department of Biological Sciences

TL;DR
This study demonstrates that selecting non-differentially expressed genes and using nonparametric normalization methods can significantly enhance machine learning classification performance across independent transcriptomic datasets from different platforms.
Contribution
It introduces the use of non-differentially expressed genes for normalization to improve cross-platform ML modeling of transcriptomic data, validated on independent breast cancer datasets.
Findings
NDEG-based normalization improves classification accuracy
Nonparametric normalization methods outperform parametric ones
LOG_QN and LOG_QNZ methods with neural networks yield better results
Abstract
Normalization is a critical step in quantitative analyses of biological processes. Recent works show that cross-platform integration and normalization enable machine learning (ML) training on RNA microarray and RNA-seq data, but no independent datasets were used in their studies. Therefore, it is unclear how to improve ML modelling performance on independent RNA array and RNA-seq based datasets. Inspired by the house-keeping genes that are commonly used in experimental biology, this study tests the hypothesis that non-differentially expressed genes (NDEG) may improve normalization of transcriptomic data and subsequently cross-platform modelling performance of ML models. Microarray and RNA-seq datasets of the TCGA breast cancer were used as independent training and test datasets, respectively, to classify the molecular subtypes of breast cancer. NDEG (p>0.85) and differentially expressed…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
