Towards a Method for Synthetic Generation of Persons with Aphasia Transcripts

Jason M. Pittman; Anton Phillips Jr.; Yesenia Medina-Santos; Brielle C. Stark

arXiv:2510.24817·cs.CL·October 31, 2025

Towards a Method for Synthetic Generation of Persons with Aphasia Transcripts

Jason M. Pittman, Anton Phillips Jr., Yesenia Medina-Santos, Brielle C. Stark

PDF

TL;DR

This paper explores methods to generate synthetic aphasic speech transcripts using procedural programming and large language models, aiming to address data scarcity in aphasia research and improve automated recognition systems.

Contribution

It introduces two novel methods for synthetic transcript generation, leveraging LLMs and procedural techniques, validated across severity levels to better model aphasic speech patterns.

Findings

01

Mistral 7b Instruct best captures linguistic degradation

02

Synthetic transcripts show realistic changes in NDW, word count, and length

03

Methods can aid in creating larger datasets for aphasia research

Abstract

In aphasia research, Speech-Language Pathologists (SLPs) devote extensive time to manually coding speech samples using Correct Information Units (CIUs), a measure of how informative an individual sample of speech is. Developing automated systems to recognize aphasic language is limited by data scarcity. For example, only about 600 transcripts are available in AphasiaBank yet billions of tokens are used to train large language models (LLMs). In the broader field of machine learning (ML), researchers increasingly turn to synthetic data when such are sparse. Therefore, this study constructs and validates two methods to generate synthetic transcripts of the AphasiaBank Cat Rescue picture description task. One method leverages a procedural programming approach while the second uses Mistral 7b Instruct and Llama 3.1 8b Instruct LLMs. The methods generate transcripts across four severity…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.