Microtask crowdsourcing for disease mention annotation in PubMed abstracts
Benjamin M Good, Max Nanis, Andrew I. Su

TL;DR
This study demonstrates that microtask crowdsourcing via Amazon Mechanical Turk can efficiently produce high-quality disease mention annotations in biomedical literature, matching expert standards and enabling scalable corpus creation.
Contribution
The paper introduces a refined crowdsourcing protocol that achieves high annotation accuracy for disease mentions in PubMed abstracts, validated against a gold standard corpus.
Findings
Achieved an F measure of 0.872 against the gold standard.
Annotations can be tuned for higher precision or recall.
Cost-effective annotation at $0.06 per abstract per worker.
Abstract
Identifying concepts and relationships in biomedical text enables knowledge to be applied in computational analyses. Many biological natural language process (BioNLP) projects attempt to address this challenge, but the state of the art in BioNLP still leaves much room for improvement. Progress in BioNLP research depends on large, annotated corpora for evaluating information extraction systems and training machine learning models. Traditionally, such corpora are created by small numbers of expert annotators often working over extended periods of time. Recent studies have shown that workers on microtask crowdsourcing platforms such as Amazon's Mechanical Turk (AMT) can, in aggregate, generate high-quality annotations of biomedical text. Here, we investigated the use of the AMT in capturing disease mentions in PubMed abstracts. We used the NCBI Disease corpus as a gold standard for…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMobile Crowdsensing and Crowdsourcing · Biomedical Text Mining and Ontologies · Topic Modeling
