Handling highly correlated genes in prediction analysis of genomic studies
Li Xing, Songwan Joun, Kurt Mackay, Mary Lesperance, and Xuekui Zhang

TL;DR
This paper introduces a grouping algorithm for highly correlated genes in genomic prediction models, improving robustness and interpretability by representing gene groups and maintaining biological signals.
Contribution
The novel grouping algorithm effectively handles correlated genes, enhancing prediction accuracy and biomarker discovery in genomic studies.
Findings
Significantly outperforms standard models in phenotype prediction
Improves robustness of feature selection under condition changes
Identifies gene groups as potential biomarkers
Abstract
Background: Selecting feature genes to predict phenotypes is one of the typical tasks in analyzing genomics data. Though many general-purpose algorithms were developed for prediction, dealing with highly correlated genes in the prediction model is still not well addressed. High correlation among genes introduces technical problems, such as multi-collinearity issues, leading to unreliable prediction models. Furthermore, when a causal gene (whose variants have an actual biological effect on a phenotype) is highly correlated with other genes, most algorithms select the feature gene from the correlated group in a purely data-driven manner. Since the correlation structure among genes could change substantially when condition changes, the prediction model based on not correctly selected feature genes is unreliable. Therefore, we aim to keep the causal biological signal in the prediction…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGene expression and cancer classification · Bioinformatics and Genomic Networks · Single-cell and spatial transcriptomics
