Beyond the E-value: stratified statistics for protein domain prediction

Alejandro Ochoa; John D. Storey; Manuel Llin\'as; and Mona Singh

arXiv:1409.6384·q-bio.GN·February 17, 2016·PLoS Comput. Biol.

Beyond the E-value: stratified statistics for protein domain prediction

Alejandro Ochoa, John D. Storey, Manuel Llin\'as, and Mona Singh

PDF

3 Repos

TL;DR

This paper introduces stratified statistical methods, specifically stratified q-values and local FDRs, to improve protein domain prediction accuracy over traditional E-value based approaches, with broader implications for bioinformatics.

Contribution

It develops the first FDR-estimating algorithms for protein domain prediction and demonstrates the superiority of stratified q-values and lFDRs over E-values in this context.

Findings

01

Stratified q-values outperform E-values in domain prediction.

02

Stratified lFDRs outperform q-values for certain domain families.

03

FDR-based thresholds can significantly improve prediction accuracy.

Abstract

E-values have been the dominant statistic for protein sequence analysis for the past two decades: from identifying statistically significant local sequence alignments to evaluating matches to hidden Markov models describing protein domain families. Here we formally show that for "stratified" multiple hypothesis testing problems, controlling the local False Discovery Rate (lFDR) per stratum, or partition, yields the most predictions across the data at any given threshold on the FDR or E-value over all strata combined. For the important problem of protein domain prediction, a key step in characterizing protein structure, function and evolution, we show that stratifying statistical tests by domain family yields excellent results. We develop the first FDR-estimating algorithms for domain prediction, and evaluate how well thresholds based on q-values, E-values and lFDRs perform in domain…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.