Association of normalization, non-differentially expressed genes and   data source with machine learning performance in intra-dataset or   cross-dataset modelling of transcriptomic and clinical data

Fei Deng; Lanjing Zhang

arXiv:2502.18888·q-bio.QM·February 28, 2025

Association of normalization, non-differentially expressed genes and data source with machine learning performance in intra-dataset or cross-dataset modelling of transcriptomic and clinical data

Fei Deng, Lanjing Zhang

PDF

TL;DR

This study investigates how normalization, non-differentially expressed genes, and data source influence machine learning performance in intra- and cross-dataset modeling of transcriptomic and clinical data, revealing key factors affecting model transferability.

Contribution

It provides new insights into the effects of normalization, NDEG, and data source on ML performance across datasets in transcriptomic and clinical data modeling.

Findings

01

Normalization and NDEG improve intra-dataset ML performance.

02

Cross-dataset ML performance is mainly influenced by data source and transcriptomic data.

03

Support vector machine was the most frequently best-performing model.

Abstract

Cross-dataset testing is critical for examining machine learning (ML) model's performance. However, most studies on modelling transcriptomic and clinical data only conducted intra-dataset testing. It is also unclear whether normalization and non-differentially expressed genes (NDEG) can improve cross-dataset modeling performance of ML. We thus aim to understand whether normalization, NDEG and data source are associated with performance of ML in cross-dataset testing. The transcriptomic and clinical data shared by the lung adenocarcinoma cases in TCGA and ONCOSG were used. The best cross-dataset ML performance was reached using transcriptomic data alone and statistically better than those using transcriptomic and clinical data. The best balance accuracy, area under curve and accuracy were significantly better in ML algorithms training on TCGA and tested on ONCOSG than those trained on…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.