# BioTriplex: a full-text annotated corpus for fine-tuning language models in gene-disease relation extraction tasks

**Authors:** Charlotte Collins, Panagiotis Fytas, İlknur Karadeniz, Huiyuan Zheng, Simon Baker, Ulla Stenius, Anna Korhonen

PMC · DOI: 10.1093/bioinformatics/btag037 · Bioinformatics · 2026-01-21

## TL;DR

BioTriplex is a new annotated dataset of biomedical articles used to improve language models in identifying gene-disease relationships.

## Contribution

BioTriplex introduces a manually annotated full-text corpus for fine-tuning language models in gene-disease relation extraction.

## Key findings

- BioTriplex outperforms zero-shot and few-shot methods in gene-disease relation extraction.
- The fine-tuned model achieves better performance than GPT-4 and Claude Sonnet 3.7 in this task.
- The dataset enables classification of 21 subtypes of gene-disease relationships with high granularity.

## Abstract

Automatic information extraction from biomedical texts requires machine learning methodology that can recognize biomedical entities, characterize inter-entity relationships, and relate extracted information to specific research topics. Large language models (LLMs) excel in general tasks but perform less reliably in the biomedical domain, where texts are characterized by extensive technical terminology and semantic variations from general literature. There is an unmet need for annotated full-text datasets that can be used to fine-tune language models for significant biomedical applications. Here, we focus on extraction of the complex relationships between genes and diseases.

We present BioTriplex, a corpus of 100 full-length biomedical research articles (comprising 604 subsection texts) manually annotated with disease names, genes, and 21 subtypes of disease–gene relationships. We employ BioTriplex to train the LLaMA 3.1 8B language model in gene–disease relation extraction. Our fine-tuned model outperforms zero-shot and few-shot approaches, both within the LLaMA 3.1 architecture and across the larger state-of-the-art LLMs GPT-4 and Claude Sonnet 3.7, and classifies gene–disease relation types with broader scope and greater granularity than previously described. These results validate BioTriplex as a useful full-text data resource and underscore the value of specialized datasets in fine-tuning language models for important biomedical tasks.

https://github.com/PanagiotisFytas/BioTriplex

## Full-text entities

- **Diseases:** COVID-19 (MESH:D000086382), influenza (MESH:D007251), -disease (MESH:D004194), respiratory diseases (MESH:D012140), physical disorder (MESH:D059445), multiple sclerosis (MESH:D009103), disease of metabolism (MESH:D008659), disease of cellular (MESH:D004806), disease of cellular proliferation (MESH:C565054), disease of mental health (OMIM:603663), hypertension (MESH:D006973), cancer (MESH:D009369), disease by infectious agent (MESH:D003141), anatomical (MESH:D020763), inflammatory bowel disease (MESH:D015212), RE (MESH:D019973), tuberculosis (MESH:D014376), genetic disease (MESH:D030342)
- **Species:** Homo sapiens (human, species) [taxon 9606]

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/PMC12883087/full.md

## Figures

5 figures with captions in the complete paper: https://tomesphere.com/paper/PMC12883087/full.md

## References

35 references — full list in the complete paper: https://tomesphere.com/paper/PMC12883087/full.md

---
Source: https://tomesphere.com/paper/PMC12883087