Sample size determination for training cancer classifiers from microarray and RNA-seq data
Sandra Safo, Xiao Song, Kevin K. Dobbin

TL;DR
This paper develops and evaluates methods for estimating the necessary sample size to train effective cancer classifiers using high-dimensional microarray and RNA-seq data, emphasizing the importance of pilot data.
Contribution
It introduces a novel sample size method tailored for lasso logistic regression and demonstrates its application using real and simulated pilot data.
Findings
Sample size estimation is feasible with adequate pilot data.
Existing human RNA-seq datasets are generally insufficient as pilot data.
Simulated pilot RNA-seq data can be used effectively for sample size planning.
Abstract
The objective of many high-dimensional microarray and RNA-seq studies is to develop a classifier of cancer patients based on characteristics of their disease. The germinal center B-cell (GCB) classifier study in lymphoma and the National Cancer Institute's Director's Challenge lung (DC-lung) study are two examples. In recent years, such classifiers are often developed using regularized regression, such as the lasso. A critical question is whether a better classifier can be developed from a larger training set size and, if so, how large the training set should be. This paper examines these two questions using an existing sample size method and a novel sample size method developed here specifically for lasso logistic regression. Both methods are based on pilot data. We reexamine the lymphoma and lung cancer data sets to evaluate the sample sizes, and use resampling to assess the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
