Are Thousands of Samples Really Needed to Generate Robust Gene-List for Prediction of Cancer Outcome?
Royi Jacobovic

TL;DR
This paper questions the necessity of thousands of samples for robust gene list prediction of cancer outcomes, highlighting potential overestimations due to model assumption violations and empirical Bayes limitations.
Contribution
It challenges prior conclusions by demonstrating that key statistical assumptions are inconsistent with sparsity and Gaussianity, and that empirical Bayes methods may overestimate sample size needs.
Findings
Model assumptions are inconsistent with sparsity and Gaussianity.
Empirical Bayes methods fail to detect severe assumption violations.
Overestimation of required sample size may occur due to these issues.
Abstract
The prediction of cancer prognosis and metastatic potential immediately after the initial diagnoses is a major challenge in current clinical research. The relevance of such a signature is clear, as it will free many patients from the agony and toxic side-effects associated with the adjuvant chemotherapy automatically and sometimes carelessly subscribed to them. Motivated by this issue, Ein-Dor (2006) and Zuk (2007) presented a Bayesian model which leads to the following conclusion: Thousands of samples are needed to generate a robust gene list for predicting outcome. This conclusion is based on existence of some statistical assumptions. The current work raises doubts over this determination by showing that: (1) These assumptions are not consistent with additional assumptions such as sparsity and Gaussianity. (2) The empirical Bayes methodology which was suggested in order to test the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGene expression and cancer classification · Bioinformatics and Genomic Networks · Biomedical Text Mining and Ontologies
