TL;DR
This paper introduces stratified statistical methods, specifically stratified q-values and local FDRs, to improve protein domain prediction accuracy over traditional E-value based approaches, with broader implications for bioinformatics.
Contribution
It develops the first FDR-estimating algorithms for protein domain prediction and demonstrates the superiority of stratified q-values and lFDRs over E-values in this context.
Findings
Stratified q-values outperform E-values in domain prediction.
Stratified lFDRs outperform q-values for certain domain families.
FDR-based thresholds can significantly improve prediction accuracy.
Abstract
E-values have been the dominant statistic for protein sequence analysis for the past two decades: from identifying statistically significant local sequence alignments to evaluating matches to hidden Markov models describing protein domain families. Here we formally show that for "stratified" multiple hypothesis testing problems, controlling the local False Discovery Rate (lFDR) per stratum, or partition, yields the most predictions across the data at any given threshold on the FDR or E-value over all strata combined. For the important problem of protein domain prediction, a key step in characterizing protein structure, function and evolution, we show that stratifying statistical tests by domain family yields excellent results. We develop the first FDR-estimating algorithms for domain prediction, and evaluate how well thresholds based on q-values, E-values and lFDRs perform in domain…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
