SynGP500: A Clinically-Grounded Synthetic Dataset of Australian General Practice Medical Notes
Piyawoot Songsiritat

TL;DR
SynGP500 is a synthetic dataset of 500 Australian general practice notes designed to improve clinical NLP models by reflecting real-world complexity, diverse conditions, and epidemiological accuracy.
Contribution
It introduces a large, diverse, and realistic synthetic dataset for Australian general practice, addressing data scarcity and privacy concerns in clinical NLP research.
Findings
Dataset aligns with epidemiological patterns from BEACH study
High linguistic and semantic diversity confirmed by analysis
Improves medical concept extraction performance in downstream tasks
Abstract
We introduce SynGP500, a clinician-curated collection of 500 synthetic Australian general practice medical notes. The dataset integrates curriculum-based clinical breadth (RACGP 2022 Curriculum), epidemiologically-calibrated prevalence (BEACH study), and diverse consultation contexts. This approach systematically includes both common presentations and less-common curriculum-specified conditions that GPs must recognize but appear infrequently in single practice populations, potentially supporting more generalizable model training than datasets constrained by naturally occurring case distributions. SynGP500 is messy by design, reflecting the authentic complexity of healthcare delivery: telegraphic documentation, typos, patient non-adherence, socioeconomic barriers, and clinician-patient disagreements, unlike sanitized synthetic datasets that obscure clinical realities. Multi-faceted…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMachine Learning in Healthcare · Artificial Intelligence in Healthcare and Education · Data-Driven Disease Surveillance
