Statistical models for RNA-seq data derived from a two-condition   48-replicate experiment

Marek Gierli\'nski; Christian Cole; Piet\`a Schofield; Nicholas J.; Schurch; Alexander Sherstnev; Vijender Singh; Nicola Wrobel; Karim Gharbi,; Gordon Simpson; Tom Owen-Hughes; Mark Blaxter; Geoffrey J. Barton

arXiv:1505.00588·q-bio.GN·July 28, 2015·Bioinform.

Statistical models for RNA-seq data derived from a two-condition 48-replicate experiment

Marek Gierli\'nski, Christian Cole, Piet\`a Schofield, Nicholas J., Schurch, Alexander Sherstnev, Vijender Singh, Nicola Wrobel, Karim Gharbi,, Gordon Simpson, Tom Owen-Hughes, Mark Blaxter, Geoffrey J. Barton

PDF

TL;DR

This study analyzes a large 48-replicate RNA-seq experiment in yeast to evaluate the fit of statistical models like negative binomial and log-normal distributions for gene read counts, confirming their validity and highlighting the importance of quality control.

Contribution

It provides empirical validation of statistical models for RNA-seq data using high-replicate experiments, which was previously limited to low-replicate data or simulations.

Findings

01

Gene read counts fit negative binomial and log-normal models

02

Mean-variance relation follows a constant dispersion of ~0.01

03

High-replicate data enables effective quality control

Abstract

High-throughput RNA sequencing (RNA-seq) is now the standard method to determine differential gene expression. Identifying differentially expressed genes crucially depends on estimates of read count variability. These estimates are typically based on statistical models such as the negative binomial distribution, which is employed by the tools edgeR, DESeq and cuffdiff. Until now, the validity of these models has usually been tested on either low-replicate RNA-seq data or simulations. A 48-replicate RNA-seq experiment in yeast was performed and data tested against theoretical models. The observed gene read counts were consistent with both log-normal and negative binomial distributions, while the mean-variance relation followed the line of constant dispersion parameter of ~0.01. The high-replicate data also allowed for strict quality control and screening of bad replicates, which can…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.