Towards a Method for Synthetic Generation of Persons with Aphasia Transcripts
Jason M. Pittman, Anton Phillips Jr., Yesenia Medina-Santos, Brielle C. Stark

TL;DR
This paper explores methods to generate synthetic aphasic speech transcripts using procedural programming and large language models, aiming to address data scarcity in aphasia research and improve automated recognition systems.
Contribution
It introduces two novel methods for synthetic transcript generation, leveraging LLMs and procedural techniques, validated across severity levels to better model aphasic speech patterns.
Findings
Mistral 7b Instruct best captures linguistic degradation
Synthetic transcripts show realistic changes in NDW, word count, and length
Methods can aid in creating larger datasets for aphasia research
Abstract
In aphasia research, Speech-Language Pathologists (SLPs) devote extensive time to manually coding speech samples using Correct Information Units (CIUs), a measure of how informative an individual sample of speech is. Developing automated systems to recognize aphasic language is limited by data scarcity. For example, only about 600 transcripts are available in AphasiaBank yet billions of tokens are used to train large language models (LLMs). In the broader field of machine learning (ML), researchers increasingly turn to synthetic data when such are sparse. Therefore, this study constructs and validates two methods to generate synthetic transcripts of the AphasiaBank Cat Rescue picture description task. One method leverages a procedural programming approach while the second uses Mistral 7b Instruct and Llama 3.1 8b Instruct LLMs. The methods generate transcripts across four severity…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
