Almost Clinical: Linguistic properties of synthetic electronic health records

Serge Sharoff; John Baker; David Francis Hunt; Alan Simpson

arXiv:2601.01171·cs.CL·February 5, 2026

Almost Clinical: Linguistic properties of synthetic electronic health records

Serge Sharoff, John Baker, David Francis Hunt, Alan Simpson

PDF

Open Access 1 Video

TL;DR

This paper assesses the linguistic and clinical quality of synthetic electronic health records generated by large language models, highlighting their potential for research and current limitations in clinical accuracy and specificity.

Contribution

It provides a detailed analysis of how LLMs construct clinical language and identifies key divergences from real records, advancing understanding of synthetic health data.

Findings

01

LLMs produce coherent, terminology-appropriate clinical texts

02

Systematic divergences include register shifts and clinical inaccuracies

03

Synthetic records enable large-scale linguistic research

Abstract

This study evaluates the linguistic and clinical suitability of synthetic electronic health records in mental health. First, we describe the rationale and the methodology for creating the synthetic corpus. Second, we examine expressions of agency, modality, and information flow across four clinical genres (Assessments, Correspondence, Referrals and Care plans) with the aim to understand how LLMs grammatically construct medical authority and patient agency through linguistic choices. While LLMs produce coherent, terminology-appropriate texts that approximate clinical practice, systematic divergences remain, including registerial shifts, insufficient clinical specificity, and inaccuracies in medication use and diagnostic procedures. The results show both the potential and limitations of synthetic corpora for enabling large-scale linguistic research otherwise impossible with genuine…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Almost Clinical: Linguistic properties of synthetic electronic health records· underline

Taxonomy

TopicsElectronic Health Records Systems · Machine Learning in Healthcare · Interpreting and Communication in Healthcare