Integration of AI-generated clinic letters in complex paediatric neurosurgery outpatient settings
Mohamed Elmolla, Anjum Aarifa Khanom, Ahmad M. S. Ali, Christian Duncan, Anusha Hennedige, Vejay N. Vakharia

TL;DR
This study shows that AI-generated clinic letters are more readable than manually dictated ones in pediatric neurosurgery settings, without losing clinical accuracy.
Contribution
Validates AI-generated clinic letters in complex, multi-stakeholder clinical settings for improved readability and accuracy.
Findings
AI-generated letters had significantly better readability scores than clinician-dictated letters.
Blinded clinicians preferred AI-generated letters in 75% of cases.
AI-generated letters maintained clinical accuracy while improving readability.
Abstract
Dictation of outpatient clinic letters can result in increased workload for clinicians. The use of generative, natural language processing artificial intelligence (AI) software could be used to supplant dictation and typing, alleviating the clinician’s workload. Therefore, our objective was to validate the use of AI software, Lyrebird AI (Lyrebird Health, Ltd.) to create accurate and readable clinic letters, in the context of a single clinician general paediatric neurosurgery clinic and a multi-disciplinary craniofacial clinic. Twenty consultations were included, wherein a microphone was used to record the entire consultation. For each consultation, two letters were generated independently: (1) Lyrebird AI letter automatically generated at the end of the recording and (2) human clinician manual dictation in the usual manner. The letters were compared using objective readability metrics…
Genes, proteins, chemicals, diseases, species, mutations and cell lines named across the full text — each resolved to its canonical identifier and authoritative record.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsArtificial Intelligence in Healthcare and Education · Surgical Simulation and Training · Electronic Health Records Systems
Introduction
Accurate and comprehensible communication between clinicians and patients is essential. In the outpatient setting, clinic letters are used to summarise discussions and are an important record of the patient’s clinical history, diagnosis, and treatment plan. Presently, dictation, which requires human transcription and checking, is used to create clinic letters. This has been shown to be an inefficient use of a clinician’s time, resulting in increasing stress and job dissatisfaction [1–4]. Additionally, it can also add delay to the letter reaching both the patient and other clinicians involved with the management.
Whilst conventional voice recognition software can be used to mitigate typing, it requires dictation and remains inaccurate due to its inability to process, learn from, and understand input. In comparison, novel generative artificial intelligence (AI) or natural language processing (NLP) tools can be used to automate the generation of clinic letters without the need for any formal dictation simply by passively ‘listening’ to the consultation. This reduces overall consultation times and cognitive burden for the clinician, but it is essential that the text output is clinically accurate without compromising readability and language nuance [5, 6].
With the rapid progression in artificial intelligence (AI) and the development of large language models (LLMs), commercial software is now available that can summarise pertinent information from the consultation without the need for the clinician to dictate a letter. By passively listening in to the consultation, the software can distinguish the healthcare professionals from the patient and family members, extract clinically useful information from the consultation, and summarise the information in a letter format that has been predefined as a template. To determine if such models are practical and accurate enough for use in a complex outpatient setting, we undertook a validation study including both a single clinician general paediatric neurosurgery clinic and a multi-disciplinary craniofacial clinic.
Methods
A prospective case–control study was conducted over three consecutive clinics, using Lyrebird AI (Lyrebird Health, Ltd.), as it is the NLP tool being trialled at our institution.
In this study, the outpatient clinic runs in the typical manner—the clinician sees the patient who attends with their family member(s). The clinician also opens the Lyrebird application at the beginning of the appointment, which records the consultation via a dictaphone that is placed on the table. Verbal consent was taken at the start of the consultation prior to starting the recording in line with the current trust usage policy for Lyrebird. Following this, Lyrebird assimilates the conversation in real time and generates a letter immediately as the appointment concludes. The Lyrebird letters can be edited post-hoc but were not for the purposes of this study. Separately to this and in a blinded manner, the clinician retrospectively dictates their letter in the usual fashion, before it is sent for typing. Two letter templates with different, appropriate formatting were set up prior for the general neurosurgery and craniofacial clinics. All data was anonymised, and all letters were saved in the hospital’s electronic patient record.
For the general neurosurgery clinic, there was one clinician, one patient aged 0–18 years old, and 1–4 family members. In contrast, the MDT clinic contained eight clinicians (comprising a neurosurgeon, plastic surgeon and/or maxillofacial surgeon, speech and language therapist, geneticist, dentist/orthodontist, specialist nurse, and a clinical psychologist), one patient aged 0–15 years old, and 1–2 family members.
The study included twenty consultations. A comparison between the Lyrebird AI–generated and clinician-dictated letters was then undertaken, using four objective metrics of readability (see supplementary materials for further information):
- The Flesch–Kincaid Grade Level (FKGL) uses the total number of syllables, words, and sentences to derive the approximate age of schooling needed to understand a given piece of text [7]. A higher score confers a higher level of schooling required.
- The Flesch Reading Ease (FRE) was used to assess the overall readability of the clinic letters, in which scores are similarly calculated using the total number of syllables, words, and sentences [8]. Scores span a scale of 1–100, with higher scores conferring easier readability.
- The Gunning Fog Index (GFI) uses total words, sentences, and word complexity to assess the overall complexity of text [9]. On a scale of 0–20, higher scores indicate higher complexity.
- The Simple Measure of Gobbledygook (SMOG index) was used to assess the simplicity of a text by considering the polysyllabic word count; higher scores relate to a higher level of schooling required to read the text [10].
Human rater Likert scale (1–5) assessment was also undertaken by one expert, independent blinded clinician rater—here, they reviewed letters for content accuracy and comprehensiveness, captured as their overall subjective preference. Statistical analysis, using Wilcoxon Signed Rank test, was completed in R (version 4.4.2); a p < 0.05 was deemed statistically significant.
Results
All 20 consultations were analysed by both statistical analysis and human rater assessment. As all data were not normally distributed, non-parametric Wilcoxon signed rank test and descriptive statistics (median, interquartile range [IQR]) were used.
Quantitative analysis
When using the FKGL metric, the Lyrebird AI–generated clinic letters (median 10.3, IQR 0.75) demonstrated higher readability than the clinician-dictated letters (median 10.9, IQR 1.15) (z = 2.73, p < 0.05). Similarly, the SMOG Index demonstrated that the Lyrebird AI–generated letters (median 11.7, IQR 1.25) were more readable than the dictated letters (median 13.1, IQR 0.9) (z = − 3.36, p < 0.001) (Table 1). There was a non-significant difference between the types of letters when assessing them via the GFI (z = − 1.22, p = 0.22) and the FRE indices (z = − 0.84, p = 0.40) (Table 1). Table 1. Summary of metricsFlesch–Kincaid grade levelGunning fog indexSMOG indexFlesh reading easeExpert clinical rater–Likert scale scores & preferenceSubjectAIClinicianAIClinicianAIClinicianAIClinicianLyrebird letter scoreClinician letter scoreRater’s preference110.010.911.811.711.613.143.341.753LB28.411.411.312.911.513.354.945.253LB310.811.212.612.012.713.536.343.854LB410.512.414.514.013.714.644.839.253LB57.710.28.310.510.312.757.549.653LB69.511.211.713.511.613.547.044.944Unequivocal711.710.715.212.014.312.740.746.353LB87.57.710.08.910.710.959.264.455Unequivocal910.09.512.511.012.612.942.052.855Unequivocal109.410.311.111.711.612.448.053.745Clinician119.611.111.812.211.712.845.844.754LB1210.411.513.913.912.813.640.449.445Clinician1311.112.214.415.613.114.435.031.254LB1410.38.410.910.511.311.257.560.353LB1510.311.013.213.112.713.437.542.953LB1610.310.111.712.011.712.343.846.252LB1710.311.011.013.011.313.239.844.653LB1810.313.813.015.812.915.235.434.653LB1910.311.512.113.112.013.747.144.652LB2010.311.111.111.711.412.347.945.053LBMedian10.310.911.812.511.713.144.346.3IQR0.751.151.951.51.250.97.6755.875Wilcoxon signed rank testZ-value = 2.7253Z-value = − 1.2274Z-value = − 3.3599Z-value = − 0.84p** < 0.05p = 0.22p = 0.00078p = 0.4**IQR interquartile range, LB Lyrebird
Qualitative analysis
In the human rater assessment, the Lyrebird AI–generated letters were preferred to clinician-dictated letters in 75% (n = 15) of cases, clinician-dictated for two letters, and three letters in which there was an unequivocal difference. Both Lyrebird AI–generated and clinician-dictated letters held the same content. Importantly, Lyrebird AI–generated letters did not detail any information that was deemed unsafe or incorrect.
With regards to content, there were three primary differences between the Lyrebird AI–generated and clinician dictated letters.
Firstly, when a clinical examination of the patient was conducted (11 out of 20 consultations), Lyrebird AI–generated letters consistently offered details of the entire examination including negative findings, whereas clinicians dictated only the salient points.
Secondly, all letters formatted a summary of the management plan at the top of the clinic letter, and outside of the main body of text. However, this plan was comprehensive in all 20 of Lyrebird AI–generated letters, by informing patients and relatives of aspects which required no further action, such as “no further routine imaging required”. Clinicians only dictated aspects of management plans that would result in further action in most cases (14 out of 20 letters).
Finally, all letters also formatted a summary of the relevant diagnoses at the top of the clinic letter and outside of the main body of text. In six cases, clinician-dictated letters were more nuanced than the Lyrebird AI letters, containing adjectives that helped allude to the patient’s clinical course, for example: detailing relevant past surgical history; stating “incidental finding of posterior fossa arachnoid cyst” as opposed to “posterior fossa arachnoid cyst”; and describing “generalised cerebral atrophy” as opposed to “cerebral atrophy”, respectively.
Discussion
Present study and literature review
In this prospective case–control study, we examined the readability of 20 Lyrebird AI–generated versus 20 consultant-dictated clinic letters. Lyrebird AI was found to significantly improve the readability of clinic letters across 2 of 4 metrics, whilst maintaining clinical accuracy. There was a non-significant difference for the remaining metrics. In addition to objective measures, an independent rater subjectively described an overall preference for the Lyrebird AI–generated letters by a wide margin.
To our knowledge, this is the first paper to directly compare AI-generated and clinician-dictated clinic letters for real patients, wherein the AI software is purpose built for healthcare documentation. Lyrebird AI significantly improved the readability of clinic letters across 2 of 4 metrics, whilst maintaining accuracy. This indicates that the Lyrebird AI–generated letters are able to convey the same complexity of information at a lower reading age. In addition, given the complex nature of specialty neurosurgery and craniofacial clinic there is a certain amount of medical nomenclature such as genetic diagnoses, physical examination and operative procedures that cannot be simplified further, which may explain why differences were only found in two of the four metrics.
Our findings demonstrate that generative AI could be employed to reduce the administrative burden on clinicians. Perhaps one of the most powerful applications of Lyrebird AI is that it can contend with MDT settings to assimilate multiple specialist conversation lines into a concise summary that remains readable to non-healthcare professionals.
When considering these findings, it is important to refrain from generalising all generative AI software into a single entity [11, 12]. The majority of literature has so far been published on the use of the most widely available free chatbot, ChatGPT from Open AI (Table 2) [1, 5, 6, 13–17]. For example, a prospective cohort study carried out in 2024 found that ChatGPT was able to increase readability and maintain the accuracy of orthopaedic surgery outpatient clinic letters [13]. However, like other studies, the authors found the chatbot lacked the same language nuance as clinician-dictated letters, which resulted in sentences that lacked tact and were unnecessarily superfluous [13, 16, 17]. Whilst the current study demonstrates that clinicians do provide more technical nuance in some aspects, particularly in the summary of diagnoses, Lyrebird AI–generated letters generally held the same meaning and did not omit any clinical information that might change the clinical impression or management plan of the patient. This difference may be because Lyrebird AI is a generative AI software that is purpose-built for healthcare documentation. Table 2. Summary of studies to date comparing the qualities of AI software–generated to clinician-generated outpatient lettersAuthorYearStudy typeSample size (n)SettingAI softwareComparative metricsResultsEllmolla et al. (current study)2024Prospective case control20Real-worldPaediatric craniofacial subspecialist outpatients and MDTsLyreBird (LyreBird Health, Ltd.)Fleisch–Kincaid grade levelSignificant improvement with AISMOG indexSignificant improvement with AIGunning fog indexNo significant differenceFlesch reading easeNo significant differenceBlinded human rater: subjective preference of letterNo significant differenceBass et alApril 2025Prospective comparison study4SimulatedPaediatric ENT same day post-surgical dischargeChatGPT 3.5 (Open AI, Ltd.)Blinded human rater: quality of medical informationSignificant improvement with AIBlinded human rater: ease of readingSignificant improvement with AIBlinded human rater: length of letterNo significant differenceBalloch et al. [19]June 2024Simulated interventional study47SimulatedAdult multi-specialty outpatientsTortus (Tortus AI, Ltd.)ChatGPT 4.0 (Open AI, Ltd.)Sheffield assessment instrument for lettersSignificant improvement with AITotal timing of consultationSignificantly shorter consultation with AITotal time speaking to patientsNo significant differenceNASA task load indexSignificant improvement with AIBlinded human rater assessment: overall experienceSignificant improvement with AIStoneham et al. [13]April 2024Prospective cohort study70Real-worldAdult orthopaedic subspecialist outpatientsChatGPT 3.5 (Open AI, Ltd.)Mean word countSignificant increase with AIFlesch reading easeSignificant worsening with AIFlesch–Kincaid grade levelSignificant worsening with AIChatGPT 4.0 (Open AI, Ltd.)Mean word countSignificant increase with AIFlesch reading easeNo significant differenceFlesch–Kincaid grade levelNo significant difference
Future work
The potential of Lyrebird AI could be realised further by exploring its text output in a context-dependent manner [6]. It has potential for use in settings in which patients are not participants—for example, to summarise decisions made in inter-specialty multi-disciplinary meetings. Additionally, Lyrebird AI has the function to create several custom formatting templates that the clinician can preset. Individual specialties, and indeed individual clinicians, will have preference for clinic letter formatting in order to highlight essential information. The impact of template refinement should be investigated as a further potential time-saving avenue, and for enhancing user experience through individualising the software for each healthcare setting.
These considerations ultimately implicate future work on economic analysis of generative AI use in the clinical setting. The market for AI tools with healthcare applications is large and growing exponentially larger [18]. By nature of such a market, each individual product will require its own cost-efficacy validation, which remains a barrier, but is essential for wide-scale uptake. It remains imperative that such tools are rigorously validated, and clinical letters are checked for factual content and accuracy by clinicians, which is no different from human-generated letters.
Limitations
The limitations of this work include: (1) the objective metrics used are generalised measures of readability, as healthcare-specific readability metrics do not currently exist to our knowledge; (2) the findings are only applicable to one type of commercially available commercially available generative AI software and are therefore not generalisable; and (3) the clinical context is a highly specialised neurosurgical and craniofacial clinic, which again means the findings may not be generalisable to other settings, such as general practice or the emergency department. These limitations highlight the potential focus of future work. Namely, if generative AI is being used to enhance clinician effort, then analysis of overall time and effort saving is essential to further quantify its efficiency as a tool. Balloch et al. [19] have successfully demonstrated this in simulation, with a view to expand to real-world clinical environments.
Conclusions
Through the use of both objective readability metrics and subjective blinded clinician ratings, we provide validation of the Lyrebird AI generative software for use in a highly specialist single-clinician general neurosurgery clinic and multi-disciplinary craniofacial clinic environment. Unlike previously described technologies, it does not require the burden of a carefully considered prompt, nor does it require redaction of sensitive patient information. Rather, it is a single software that ambiently listens to clinicians, patients, and families to immediately scribe a readable and accurate letter which can be reviewed and sent without delay. Future work should focus on time, effort, and cost-saving analysis, in addition to template refinement and applications in other clinical settings.
Supplementary Information
Below is the link to the electronic supplementary material.Supplementary file1 (DOCX 19 kb)
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1Scott B (2025) The Gunning fog index (or FOG) readability formula. https://readabilityformulas.com/the-gunnings-fog-index-or-fog-readability-formula/. Accessed Date Accessed 2025 Accessed
