Pan-cancer gene set discovery via scRNA-seq for optimal deep learning based downstream tasks
Jong Hyun Kim, Jongseong Jang

TL;DR
This study develops a gene set discovery method using scRNA-seq data combined with machine learning to enhance performance in various pan-cancer downstream tasks, outperforming traditional bulk RNA-seq derived sets.
Contribution
It introduces a novel approach integrating scRNA-seq data with hdWGCNA and XGBoost for gene set selection, improving predictive accuracy in cancer genomics.
Findings
scRNA-seq derived gene sets outperform bulk RNA-seq sets in most tasks
DPM1, BAD, FKBP4 identified as key pan-cancer biomarkers
XGBoost-refined hdWGCNA gene set shows higher performance across multiple tasks
Abstract
The application of machine learning to transcriptomics data has led to significant advances in cancer research. However, the high dimensionality and complexity of RNA sequencing (RNA-seq) data pose significant challenges in pan-cancer studies. This study hypothesizes that gene sets derived from single-cell RNA sequencing (scRNA-seq) data will outperform those selected using bulk RNA-seq in pan-cancer downstream tasks. We analyzed scRNA-seq data from 181 tumor biopsies across 13 cancer types. High-dimensional weighted gene co-expression network analysis (hdWGCNA) was performed to identify relevant gene sets, which were further refined using XGBoost for feature selection. These gene sets were applied to downstream tasks using TCGA pan-cancer RNA-seq data and compared to six reference gene sets and oncogenes from OncoKB evaluated with deep learning models, including multilayer perceptrons…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMolecular Biology Techniques and Applications · Genetics, Bioinformatics, and Biomedical Research · Gene expression and cancer classification
MethodsSparse Evolutionary Training · Feature Selection
