# Clinically Informed Semi-Supervised Learning Improves Disease Annotation and Equity from Electronic Health Records: A Glaucoma Case Study

**Authors:** Mousa Moradi, Rishi Shah, Asahi Fujita, Niloufar Bineshfar, Daniel M. Vu, Kanza Aziz, Daniel L. Liebman, Saber Kazeminasab Hashemabad, Mengyu Wang, Tobias Elze, Mohammad Eslami, Nazlee Zebardast

PMC · DOI: 10.21203/rs.3.rs-7546650/v1 · Research Square · 2025-10-03

## TL;DR

A new AI method improves disease labeling in health records, especially for underrepresented groups, using clinical notes and reducing reliance on standard codes.

## Contribution

Introduces Ci-SSGAN, a semi-supervised learning framework that improves disease annotation accuracy and equity using clinical text.

## Key findings

- Ci-SSGAN achieved 0.85 accuracy and 0.95 AUROC in glaucoma annotation.
- Improved AUROC by 10.19% compared to ICD-based labels.
- Narrowed performance gaps for Black patients, women, and younger patients.

## Abstract

Clinical notes represent a vast but underutilized source of information for disease characterization, whereas structured electronic health record (EHR) data such as ICD codes are often noisy, incomplete, and too coarse to capture clinical complexity. These limitations constrain the accuracy of datasets used to investigate disease pathogenesis and progression and to develop robust artificial intelligence (AI) systems. To address this challenge, we introduce Ci-SSGAN (Clinically Informed Semi-Supervised Generative Adversarial Network), a novel framework that leverages large-scale unlabeled clinical text to reannotate patient conditions with improved accuracy and equity. As a case study, we applied Ci-SSGAN to glaucoma, a leading cause of irreversible blindness characterized by pronounced racial and ethnic disparities. Trained on 2.1 million ophthalmology notes, Ci-SSGAN achieved 0.85 accuracy and 0.95 AUROC, representing a 10.19% AUROC improvement compared to ICD-based labels (0.74 accuracy, 0.85 AUROC). Ci-SSGAN also narrowed subgroup performance gaps, with F1 gains for Black patients (+ 0.05), women (+ 0.06), and younger patients (+ 0.033). By integrating semi-supervised learning and demographic conditioning, Ci-SSGAN minimizes reliance on expert annotations, making AI development more accessible to resource-constrained healthcare systems.

## Linked entities

- **Diseases:** glaucoma (MONDO:0005041)

## Full-text entities

- **Diseases:** Glaucoma (MESH:D005901), blindness (MESH:D001766)
- **Chemicals:** SSGAN (-)
- **Species:** Homo sapiens (human, species) [taxon 9606]

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/PMC12622191/full.md

## Figures

9 figures with captions in the complete paper: https://tomesphere.com/paper/PMC12622191/full.md

## References

48 references — full list in the complete paper: https://tomesphere.com/paper/PMC12622191/full.md

---
Source: https://tomesphere.com/paper/PMC12622191