Correcting gene expression data when neither the unwanted variation nor the factor of interest are observed
Laurent Jacob, Johann Gagnon-Bartsch, Terence P. Speed

TL;DR
This paper introduces methods to correct gene expression data for unwanted variation using control genes and replicates, enabling unsupervised analysis without losing signals of interest, and demonstrates their effectiveness on multiple datasets.
Contribution
The paper presents novel correction techniques that do not require observing the factor of interest, leveraging control genes and replicates to estimate unwanted variation.
Findings
Methods effectively remove unwanted variation while preserving signals.
Proposed techniques outperform existing correction methods.
Approach is validated on three gene expression datasets.
Abstract
When dealing with large scale gene expression studies, observations are commonly contaminated by unwanted variation factors such as platforms or batches. Not taking this unwanted variation into account when analyzing the data can lead to spurious associations and to missing important signals. When the analysis is unsupervised, e.g., when the goal is to cluster the samples or to build a corrected version of the dataset - as opposed to the study of an observed factor of interest - taking unwanted variation into account can become a difficult task. The unwanted variation factors may be correlated with the unobserved factor of interest, so that correcting for the former can remove the latter if not done carefully. We show how negative control genes and replicate samples can be used to estimate unwanted variation in gene expression, and discuss how this information can be used to correct the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
