Language corpora for the Dutch medical domain
B. van Es

TL;DR
This paper introduces the first large-scale Dutch medical language corpus, consisting of approximately 35 billion tokens from 100 million documents, to advance NLP in the medical domain.
Contribution
It creates and publicly releases a comprehensive Dutch medical corpus by translating datasets, identifying medical texts, and extracting resources, enabling improved NLP applications.
Findings
Corpus contains ~35 billion tokens across 100 million documents.
The corpus is freely available on Hugging Face.
Establishes a foundational resource for Dutch medical NLP.
Abstract
\textbf{Background:} Dutch medical corpora are scarce, limiting NLP development. \\ \textbf{Methods:} We translated English datasets, identified medical text in generic corpora, and extracted open Dutch medical resources. \\ \textbf{Results:} The resulting corpus comprises 35 billion tokens across the medical domain in about 100 million documents, freely available on Hugging Face. \\ \textbf{Conclusion:} This work establishes the first large-scale Dutch medical language corpus for pre-training and downstream NLP tasks.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
