Biomed-Enriched: A Biomedical Dataset Enriched with LLMs for Pretraining and Extracting Rare and Hidden Content

Rian Touchent; Nathan Godey; Eric de la Clergerie

arXiv:2506.20331·cs.CL·June 26, 2025

Biomed-Enriched: A Biomedical Dataset Enriched with LLMs for Pretraining and Extracting Rare and Hidden Content

Rian Touchent, Nathan Godey, Eric de la Clergerie

PDF

Open Access 2 Datasets

TL;DR

Biomed-Enriched is a large biomedical text dataset created using LLMs for annotation and label propagation, enabling improved biomedical NLP tasks and clinical content access.

Contribution

The paper introduces a novel two-stage annotation process using LLMs to create a large, high-quality biomedical dataset with refined subsets for NLP applications.

Findings

01

Clinical upsampling improves model performance by ~5%.

02

Educational quality filtering enhances QA task accuracy by ~1%.

03

Curated subsets enable faster training convergence.

Abstract

We introduce Biomed-Enriched, a biomedical text dataset constructed from PubMed via a two-stage annotation process. In the first stage, a large language model annotates 400K paragraphs from PubMed scientific articles, assigning scores for their type (review, study, clinical case, other), domain (clinical, biomedical, other), and educational quality. The educational quality score (rated 1 to 5) estimates how useful a paragraph is for college-level learning. These annotations are then used to fine-tune a small language model, which propagates the labels across the full PMC-OA corpus. The resulting metadata allows us to extract refined subsets, including 2M clinical case paragraphs with over 450K high-quality ones from articles with commercial-use licenses, and to construct several variants via quality filtering and domain upsampling. Clinical text is typically difficult to access due to…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsBiomedical Text Mining and Ontologies