Almost Clinical: Linguistic properties of synthetic electronic health records
Serge Sharoff, John Baker, David Francis Hunt, Alan Simpson

TL;DR
This paper assesses the linguistic and clinical quality of synthetic electronic health records generated by large language models, highlighting their potential for research and current limitations in clinical accuracy and specificity.
Contribution
It provides a detailed analysis of how LLMs construct clinical language and identifies key divergences from real records, advancing understanding of synthetic health data.
Findings
LLMs produce coherent, terminology-appropriate clinical texts
Systematic divergences include register shifts and clinical inaccuracies
Synthetic records enable large-scale linguistic research
Abstract
This study evaluates the linguistic and clinical suitability of synthetic electronic health records in mental health. First, we describe the rationale and the methodology for creating the synthetic corpus. Second, we examine expressions of agency, modality, and information flow across four clinical genres (Assessments, Correspondence, Referrals and Care plans) with the aim to understand how LLMs grammatically construct medical authority and patient agency through linguistic choices. While LLMs produce coherent, terminology-appropriate texts that approximate clinical practice, systematic divergences remain, including registerial shifts, insufficient clinical specificity, and inaccuracies in medication use and diagnostic procedures. The results show both the potential and limitations of synthetic corpora for enabling large-scale linguistic research otherwise impossible with genuine…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsElectronic Health Records Systems · Machine Learning in Healthcare · Interpreting and Communication in Healthcare
