An approach to describing and analysing bulk biological annotation quality: a case study using UniProtKB
Michael J. Bell, Colin S. Gillespie, Daniel Swan, Phillip Lord

TL;DR
This study investigates the quality of biological annotations in UniProtKB by analyzing word reuse patterns and applying Zipf's Law, providing a potential metric for assessing annotation reliability over time.
Contribution
It introduces a novel approach using power-law distributions of word reuse to evaluate annotation quality and distinguishes between manual and automated annotations in UniProtKB.
Findings
Clear trends in annotation quality over time.
Distinction between manual and automated annotations.
Potential for a generic quality assessment metric.
Abstract
Motivation: Annotations are a key feature of many biological databases, used to convey our knowledge of a sequence to the reader. Ideally, annotations are curated manually, however manual curation is costly, time consuming and requires expert knowledge and training. Given these issues and the exponential increase of data, many databases implement automated annotation pipelines in an attempt to avoid un-annotated entries. Both manual and automated annotations vary in quality between databases and annotators, making assessment of annotation reliability problematic for users. The community lacks a generic measure for determining annotation quality and correctness, which we look at addressing within this article. Specifically we investigate word reuse within bulk textual annotations and relate this to Zipf's Principle of Least Effort. We use UniProt Knowledge Base (UniProtKB) as a case…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsBiomedical Text Mining and Ontologies
