Clinical Document Corpora -- Real Ones, Translated and Synthetic Substitutes, and Assorted Domain Proxies: A Survey of Diversity in Corpus Design, with Focus on German Text Data
Udo Hahn

TL;DR
This survey examines diverse corpus designs for German clinical text data, highlighting the scarcity of accessible real corpora and the use of proxies like translated or synthetic datasets, raising questions on their validity.
Contribution
It provides a comprehensive overview of existing German clinical corpora and proxies, analyzing their diversity and highlighting the challenges in data accessibility and validity.
Findings
Majority of German clinical corpora are inaccessible, with proxies used as substitutes.
Identified 92 corpus versions, including 46 real, 5 translated, and 6 synthetic corpora.
Proxies vary in closeness to real data, affecting their validity for research.
Abstract
We survey clinical document corpora, with focus on German textual data. Due to rigid data privacy legislation in Germany these resources, with only few exceptions, are stored in safe clinical data spaces and locked against clinic-external researchers. This situation stands in stark contrast with established workflows in the field of natural language processing where easy accessibility and reuse of data collections are common practice. Hence, alternative corpus designs have been examined to escape from this data poverty. Besides machine translation of English clinical datasets and the generation of synthetic corpora with fictitious clinical contents, several other types of domain proxies have come up as substitutes for clinical documents. Common instances of close proxies are medical journal publications, therapy guidelines, drug labels, etc., more distant proxies include online…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Interpreting and Communication in Healthcare · linguistics and terminology studies
MethodsSparse Evolutionary Training · Focus
