Biases in the Experimental Annotations of Protein Function and their Effect on Our Understanding of Protein Function Space
Alexandra M. Schnoes, David C. Ream, Alexander W. Thorman, Patricia C., Babbitt, Iddo Friedberg

TL;DR
This paper analyzes how a small number of high-throughput studies dominate protein function annotations, leading to biases that affect our understanding of protein functions and their distribution across proteins.
Contribution
It quantifies the extent of bias caused by dominant studies and characterizes the nature of functional annotation biases in protein databases.
Findings
0.14% of articles provide 25% of protein annotations.
High-throughput experiments mainly provide subcellular location and developmental pathway info.
Annotations from high-throughput studies are less specific than low-throughput ones.
Abstract
The ongoing functional annotation of proteins relies upon the work of curators to capture experimental findings from scientific literature and apply them to protein sequence and structure data. However, with the increasing use of high-throughput experimental assays, a small number of experimental studies dominate the functional protein annotations collected in databases. Here we investigate just how prevalent is the "few articles -- many proteins" phenomenon. We examine the experimentally validated annotation of proteins provided by several groups in the GO Consortium, and show that the distribution of proteins per published study is exponential, with 0.14% of articles providing the source of annotations for 25% of the proteins in the UniProt-GOA compilation. Since each of the dominant articles describes the use of an assay that can find only one function or a small group of functions,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
