Language corpora for the Dutch medical domain

B. van Es

arXiv:2604.25374·cs.CL·April 29, 2026

Language corpora for the Dutch medical domain

B. van Es

PDF

4 Models 4 Datasets

TL;DR

This paper introduces the first large-scale Dutch medical language corpus, consisting of approximately 35 billion tokens from 100 million documents, to advance NLP in the medical domain.

Contribution

It creates and publicly releases a comprehensive Dutch medical corpus by translating datasets, identifying medical texts, and extracting resources, enabling improved NLP applications.

Findings

01

Corpus contains ~35 billion tokens across 100 million documents.

02

The corpus is freely available on Hugging Face.

03

Establishes a foundational resource for Dutch medical NLP.

Abstract

\textbf{Background:} Dutch medical corpora are scarce, limiting NLP development. \\ \textbf{Methods:} We translated English datasets, identified medical text in generic corpora, and extracted open Dutch medical resources. \\ \textbf{Results:} The resulting corpus comprises $\pm$ 35 billion tokens across the medical domain in about 100 million documents, freely available on Hugging Face. \\ \textbf{Conclusion:} This work establishes the first large-scale Dutch medical language corpus for pre-training and downstream NLP tasks.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

Datasets

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.