PARHAF, a human-authored corpus of clinical reports for fictitious patients in French
Xavier Tannier, Salam Abbara, R\'emi Flicoteaux, Youness Khalil, Aur\'elie N\'ev\'eol, Pierre Zweigenbaum, Emmanuel Bacry

TL;DR
PARHAF is a large, open-source, synthetic French clinical report corpus created by experts, enabling privacy-preserving NLP research and benchmarking in healthcare without sharing sensitive real data.
Contribution
It introduces a structured methodology for generating a comprehensive, shareable synthetic clinical corpus in French, covering multiple specialties and supporting NLP tasks.
Findings
Contains 7,394 reports for 5,009 patients across various specialties.
Supports multiple NLP use cases including information extraction and diagnostic coding.
Provides a privacy-preserving resource for training and evaluating clinical language models.
Abstract
The development of clinical natural language processing (NLP) systems is severely hampered by the sensitive nature of medical records, which restricts data sharing under stringent privacy regulations, particularly in France and the broader European Union. To address this gap, we introduce PARHAF, a large open-source corpus of clinical documents in French. PARHAF comprises expert-authored clinical reports describing realistic yet entirely fictitious patient cases, making it anonymous and freely shareable by design. The corpus was developed using a structured protocol that combined clinician expertise with epidemiological guidance from the French National Health Data System (SNDS), ensuring broad clinical coverage. A total of 104 medical residents across 18 specialties authored and peer-reviewed the reports following predefined clinical scenarios and document templates. The corpus…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Machine Learning in Healthcare · Data-Driven Disease Surveillance
