Learning structures of the French clinical language:development and validation of word embedding models using 21 million clinical reports from electronic health records
Basile Dura, Charline Jean, Xavier Tannier, Alice Calliger, Romain, Bey, Antoine Neuraz, R\'emi Flicoteaux

TL;DR
This study demonstrates that adapting French language models to clinical reports significantly improves NLP task performance, with retraining from scratch offering no clear advantage over fine-tuning.
Contribution
The paper presents the development and validation of specialized French clinical language models using a large corpus, showing improved performance on medical NLP tasks.
Findings
Models trained on clinical reports improved F1-score by 3 percentage points.
Specialized models achieved comparable performance to original CamemBERT.
Retraining from scratch did not significantly outperform fine-tuning.
Abstract
Background Clinical studies using real-world data may benefit from exploiting clinical reports, a particularly rich albeit unstructured medium. To that end, natural language processing can extract relevant information. Methods based on transfer learning using pre-trained language models have achieved state-of-the-art results in most NLP applications; however, publicly available models lack exposure to speciality-languages, especially in the medical field. Objective We aimed to evaluate the impact of adapting a language model to French clinical reports on downstream medical NLP tasks. Methods We leveraged a corpus of 21M clinical reports collected from August 2017 to July 2021 at the Greater Paris University Hospitals (APHP) to produce two CamemBERT architectures on speciality language: one retrained from scratch and the other using CamemBERT as its initialisation. We used two…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMachine Learning in Healthcare · Topic Modeling · Artificial Intelligence in Healthcare and Education
