# Generative AI Accelerates Genotype–Phenotype Characterization of a 1600-Case Leigh Syndrome Virtual Cohort from Published Literature

**Authors:** Lishuang Shen

PMC · DOI: 10.3390/biology15040334 · Biology · 2026-02-14

## TL;DR

A new AI system rapidly created a large virtual dataset of 1679 Leigh Syndrome patients from published literature, enabling insights into genetic and clinical patterns.

## Contribution

A novel GenAI-human workflow for creating large virtual cohorts of rare diseases from unstructured literature data.

## Key findings

- The most common inheritance patterns in LSS are autosomal recessive and mitochondrial DNA mutations.
- Patients with mitochondrial translation defects had 84% mortality and the worst survival outcomes.
- Complex V gene mutations were associated with significantly shorter survival times compared to other complexes.

## Abstract

Leigh Syndrome Spectrum (LSS) is a very rare severe brain disorder, so LSS study is hampered by the small numbers of patients per report and cost in manually merging/standardizing medical data from small reports. This study built a new generative artificial intelligence (GenAI)-based system using Google’s Gemini-2.5-pro and was supervised by medical experts. It rapidly transformed 2300 published patients’ data in just two weeks by converting their unstructured medical information into standardized clinical data dictionaries with over 90% accuracy. This work created one of the largest LSS virtual datasets, with 1679 curated cases. Analysis of this large dataset showed that most patients had either recessively inherited or mitochondrial DNA mutations. The most affected genes were SURF1, MT-ATP6, and MT-ND3. The key symptoms include lactic acidosis, muscle weakness, brain lesions, and mitochondrial dysfunction. The study also revealed that patients with defects in mitochondrial protein production had the worst survival outcomes, with 84% mortality. Patients with Complex V gene mutations survived an average of 1.77 years, half as long as other complexes. This AI-powered approach provides a scalable solution for creating large virtual patient cohorts from the published literature, accelerating research and discovery for Leigh Syndrome and other rare diseases.

Leigh Syndrome Spectrum (LSS) is a rare and heterogeneous disease continuum with most published cohorts in small sizes that limit the statistical power. Large-scale meta-analyses with published case-level clinical data extracted from the literature are essential for robust population analysis but are hindered by the burden of manually standardizing the unstructured, heterogeneous, and sparse case-level data from the literature. We developed a novel workflow which is among the first to combine Generative AI (GenAI) with human-in-the-loop curation to overcome this barrier. This pipeline utilized Google’s Gemini-2.5-pro and rapidly processed over 2300 cases from published case data tables in two weeks and achieved >90% accuracy in mapping raw clinical data to Human Phenotype Ontology (HPO) terms. This process rapidly yielded a harmonized LSS virtual cohort of 1679 data-rich cases, which is the largest LSS virtual cohort reported so far, and thus enables characterization of LSS phenotypic and genetic architectures, revealing that autosomal recessive (932 cases) and mitochondrial (752 cases) inheritance are the most common. The most frequently mutated genes were SURF1 (240 cases), MT-ATP6 (199), and MT-ND3 (183). HPO term consolidation identified common hallmark phenotypes, including lactic acidosis, hypotonia, bilateral basal ganglia lesions, and mitochondrial respiratory chain deficiency. The cohort’s scale enabled large-scale survival analysis, revealing that defects in mitochondrial translation are associated with the poorest prognosis (84% mortality in this group) and early onset (0.23 years). Among the deceased group, patients with Complex V mutations were linked to a significantly shorter mean survival time (1.77 years) than those with Complex I (3.70 years) or IV (3.57 years) mutations. This GenAI-driven methodology establishes a scalable framework for rapidly creating analysis-ready virtual cohorts from heterogeneous literature and accelerating population-level study for rare diseases including Leigh Syndrome and other mitochondrial diseases.

## Linked entities

- **Genes:** SURF1 (SURF1 cytochrome c oxidase assembly factor) [NCBI Gene 6834], ATP6 (ATP synthase F0 subunit 6) [NCBI Gene 4508], ND3 (NADH dehydrogenase subunit 3) [NCBI Gene 4537]
- **Diseases:** Leigh Syndrome (MONDO:0009723), Leigh Syndrome Spectrum (MONDO:0009723)

## Full-text entities

- **Genes:** SURF1 (SURF1 cytochrome c oxidase assembly factor) [NCBI Gene 6834] {aka CMT4K, MC4DN1, SHY1}, ATP6 (ATP synthase F0 subunit 6) [NCBI Gene 4508] {aka ATPase6, MTATP6}, ND3 (NADH dehydrogenase subunit 3) [NCBI Gene 4537] {aka MTND3}, ND5 (NADH dehydrogenase subunit 5) [NCBI Gene 4540] {aka MTND5}, ECHS1 (enoyl-CoA hydratase, short chain 1) [NCBI Gene 1892] {aka ECHS1D, SCEH, mECH, mECH1}, TRMU (tRNA mitochondrial 2-thiouridylase) [NCBI Gene 55687] {aka LCAL3, MTO2, MTU1, TRMT}, HSD17B10 (hydroxysteroid 17-beta dehydrogenase 10) [NCBI Gene 3028] {aka 17b-HSD10, ABAD, CAMR, DUPXp11.22, ERAB, HADH2}, ND6 (NADH dehydrogenase subunit 6) [NCBI Gene 4541] {aka MTND6}, NDUFAF6 (NADH:ubiquinone oxidoreductase complex assembly factor 6) [NCBI Gene 137682] {aka C8orf38, FRTS5, MC1DN17, lncREST}, NDUFA1 (NADH:ubiquinone oxidoreductase subunit A1) [NCBI Gene 4694] {aka CI-MWFE, MC1DN12, MWFE, ZNF183}, PDHA1 (pyruvate dehydrogenase E1 subunit alpha 1) [NCBI Gene 5160] {aka E1alpha, PDHA, PDHAD, PDHCE1A, PHE1A}
- **Diseases:** Abnormality of the eye (MESH:D005124), DIAGNOSIS (MESH:D001523), Lactic acidosis (MESH:D000140), Muscle weakness (MESH:D018908), Neurodevelopmental abnormality (MESH:D063647), Lactate Abnormalities (MESH:D007775), injury to (MESH:D014947), ophthalmoplegia (MESH:D009886), spasticity (MESH:D009128), dystonia (MESH:D004421), DD (MESH:C536170), optic atrophy (MESH:D009896), Mitochondrial Disease (MESH:D028361), hyperreflexia (MESH:D012021), OA (MESH:D010003), abnormality of mitochondrial metabolism (MESH:D008659), Seizures (MESH:D012640), LLMs (MESH:D007806), GenAI (MESH:C538142), respiratory (MESH:D012131), infantile spasms (MESH:D013036), falls (MESH:C537863), Complex V defects (MESH:C564964), nystagmus (MESH:D009759), HPO (MESH:D001734), Complex I (MESH:C537475), cardiomyopathy (MESH:D009202), bilateral (MESH:D006312), neurodevelopmental delay (MESH:D006968), brainstem dysfunction (MESH:D020295), Complex III deficiency (MESH:C565128), Complex I and IV (MESH:D030401), Acidosis (MESH:D000138), ptosis (MESH:C564553), Abnormal muscle tone (MESH:D009122), HP (MESH:C537262), hypotonia (MESH:D009123), activity (OMIM:612348), RF (MESH:C538347), Neuroimaging abnormalities (MESH:D000014), hallucination (MESH:D006212), Death (MESH:D003643), myoclonus (MESH:D009207), rare disease (MESH:D035583), ataxia (MESH:D001259), Abnormality of brain morphology (MESH:D001927), tremor (MESH:D014202), abnormal cerebral white matter (MESH:D014402), developmental delay (MESH:D002658), mitochondrial ATP synthase Complex deficiency (OMIM:604273), Neuromuscular and Movement Abnormalities (MESH:D009468), LS (MESH:D007888), X-linked cases (MESH:C536424), Movement disorders (MESH:D009069), Abnormal basal ganglia morphology (MESH:D001480), Complex II deficiency (MESH:C565375), hypertrichosis (MESH:D006983), PDHc enzyme deficiency (MESH:D008661), MOI (MESH:C537734), Abnormality of movement (MESH:D004409)
- **Chemicals:** lactate (MESH:D019344), coenzyme Q10 (MESH:C024989), Biotin (MESH:D001710), Thiamine (MESH:D013831), Gemini (-)
- **Species:** Homo sapiens (human, species) [taxon 9606], Pseudospironympha oblonga (species) [taxon 2971532]
- **Mutations:** m.10191T>C, m.8993T>G, c.2T>A, m.10197G>A, m.13513G>A

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/PMC12937636/full.md

## Figures

1 figure with captions in the complete paper: https://tomesphere.com/paper/PMC12937636/full.md

## References

54 references — full list in the complete paper: https://tomesphere.com/paper/PMC12937636/full.md

---
Source: https://tomesphere.com/paper/PMC12937636