Extracting Information in a Low-resource Setting: Case Study on Bioinformatics Workflows
Cl\'emence Sebe, Sarah Cohen-Boulakia, Olivier Ferret, Aur\'elie, N\'ev\'eol

TL;DR
This paper explores low-resource methods for extracting detailed bioinformatics workflow information from scientific articles, demonstrating that high-performance extraction is feasible with tailored corpora and advanced NER models.
Contribution
The study introduces BioToFlow, a new annotated corpus and evaluates multiple low-resource extraction strategies, including few-shot and knowledge-integrated NER models, for bioinformatics workflows.
Findings
BioToFlow corpus contains 52 articles with 16 annotated entities.
A SciBERT-based NER model achieved 70.4 F-measure, near inter-annotator agreement.
Knowledge integration improved entity-specific extraction performance.
Abstract
Bioinformatics workflows are essential for complex biological data analyses and are often described in scientific articles with source code in public repositories. Extracting detailed workflow information from articles can improve accessibility and reusability but is hindered by limited annotated corpora. To address this, we framed the problem as a low-resource extraction task and tested four strategies: 1) creating a tailored annotated corpus, 2) few-shot named-entity recognition (NER) with an autoregressive language model, 3) NER using masked language models with existing and new corpora, and 4) integrating workflow knowledge into NER models. Using BioToFlow, a new corpus of 52 articles annotated with 16 entities, a SciBERT-based NER model achieved a 70.4 F-measure, comparable to inter-annotator agreement. While knowledge integration improved performance for specific entities, it was…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsScientific Computing and Data Management
