Model-based clustering with data correction for removing artifacts in gene expression data
William Chad Young, Ka Yee Yeung, Adrian E. Raftery

TL;DR
This paper introduces MCDC, a novel model-based clustering method that detects and corrects artifacts in gene expression data, enhancing data reliability and subsequent analysis accuracy.
Contribution
The paper presents a new data correction method, MCDC, specifically designed to identify and fix artifacts in gene expression datasets from Luminex Bead technology.
Findings
MCDC improves agreement with external benchmarks.
MCDC enhances the quality of downstream analysis.
The method effectively corrects flipped and duplicated gene expression values.
Abstract
The NIH Library of Integrated Network-based Cellular Signatures (LINCS) contains gene expression data from over a million experiments, using Luminex Bead technology. Only 500 colors are used to measure the expression levels of the 1,000 landmark genes measured, and the data for the resulting pairs of genes are deconvolved. The raw data are sometimes inadequate for reliable deconvolution leading to artifacts in the final processed data. These include the expression levels of paired genes being flipped or given the same value, and clusters of values that are not at the true expression level. We propose a new method called model-based clustering with data correction (MCDC) that is able to identify and correct these three kinds of artifacts simultaneously. We show that MCDC improves the resulting gene expression data in terms of agreement with external baselines, as well as improving…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
