Evaluation of text data mining for database curation: lessons learned from the KDD Challenge Cup
Alexander S. Yeh, Lynette Hirschman, Alexander A. Morgan

TL;DR
This paper evaluates the effectiveness of text mining techniques in assisting biological literature curation through a challenge-based assessment, highlighting current capabilities and challenges in automating gene-related data extraction.
Contribution
It presents a large-scale evaluation of text mining methods for gene curation, providing insights into their maturity and effectiveness in a real-world genomics context.
Findings
Top systems achieved significant accuracy in identifying relevant articles
Evaluation results highlight strengths and limitations of current text mining approaches
The challenge framework facilitates benchmarking and progress tracking in bioinformatics text mining
Abstract
MOTIVATION: The biological literature is a major repository of knowledge. Many biological databases draw much of their content from a careful curation of this literature. However, as the volume of literature increases, the burden of curation increases. Text mining may provide useful tools to assist in the curation process. To date, the lack of standards has made it impossible to determine whether text mining techniques are sufficiently mature to be useful. RESULTS: We report on a Challenge Evaluation task that we created for the Knowledge Discovery and Data Mining (KDD) Challenge Cup. We provided a training corpus of 862 articles consisting of journal articles curated in FlyBase, along with the associated lists of genes and gene products, as well as the relevant data fields from FlyBase. For the test, we provided a corpus of 213 new (`blind') articles; the 18 participating groups…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsBiomedical Text Mining and Ontologies · Semantic Web and Ontologies · Scientific Computing and Data Management
