Relation Extraction in underexplored biomedical domains: A diversity-optimised sampling and synthetic data generation approach
Maxime Delmas, Magdalena Wysocka, Andr\'e Freitas

TL;DR
This paper addresses the challenge of relation extraction in underexplored biomedical domains by developing a diversity-optimized sampling method, creating synthetic data with large language models, and demonstrating improved model performance.
Contribution
It introduces the GME-sampler for balanced and diverse data selection, and a workflow for generating synthetic training data using open large language models for relation extraction.
Findings
Synthetic data improves model performance over noisy real data.
The BioGPT-Large model achieved an F1-score of 59.0 on the task.
Diversity-optimized sampling enhances dataset quality for relation extraction.
Abstract
The sparsity of labelled data is an obstacle to the development of Relation Extraction models and the completion of databases in various biomedical areas. While being of high interest in drug-discovery, the natural-products literature, reporting the identification of potential bioactive compounds from organisms, is a concrete example of such an overlooked topic. To mark the start of this new task, we created the first curated evaluation dataset and extracted literature items from the LOTUS database to build training sets. To this end, we developed a new sampler inspired by diversity metrics in ecology, named Greedy Maximum Entropy sampler, or GME-sampler (https://github.com/idiap/gme-sampler). The strategic optimization of both balance and diversity of the selected items in the evaluation set is important given the resource-intensive nature of manual curation. After quantifying the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- 🤗mdelmas/BioGPT-Large-Natural-Products-RE-Diversity-synt-v1.0model· 9 dl· ♡ 19 dl♡ 1
- 🤗mdelmas/BioGPT-Large-Natural-Products-RE-Extended-synt-v1.0model· 7 dl7 dl
- 🤗mdelmas/biogpt-Natural-Products-RE-Diversity-synt-v1.0model· 5 dl5 dl
- 🤗mdelmas/biogpt-Natural-Products-RE-Extended-synt-v1.0model· 6 dl6 dl
- 🤗mdelmas/BioGPT-Large-Natural-Products-RE-Diversity-1000-synt-v1.1model· 5 dl5 dl
- 🤗mdelmas/BioMistral-7B-Natural-Products-RE-Diversity-1000-synt-v1.2model· 7 dl7 dl
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsBiomedical Text Mining and Ontologies · Microbial Natural Products and Biosynthesis
MethodsSparse Evolutionary Training
