AID-FGS: Artificial intelligence-enabled diagnosis of female genital schistosomiasis: Preliminary findings

Akanksha Sharma; Tanmoy Dam; Sepo Mwangelwa; Chishiba Kabengele; William Kilembe; Bellington Vwalika; Mubiana Inambao; W. Evan Secor; Rachel Parker; Tyronza Skarkey; Susan Allen; Anant Madabhushi; Kristin M. Wall

PMC · DOI:10.1371/journal.pdig.0001255·February 20, 2026

AID-FGS: Artificial intelligence-enabled diagnosis of female genital schistosomiasis: Preliminary findings

Akanksha Sharma, Tanmoy Dam, Sepo Mwangelwa, Chishiba Kabengele, William Kilembe, Bellington Vwalika, Mubiana Inambao, W. Evan Secor, Rachel Parker, Tyronza Skarkey, Susan Allen, Anant Madabhushi, Kristin M. Wall

PDF

Open Access

TL;DR

This study explores using AI to diagnose a parasitic disease affecting women in Africa, which could improve healthcare access and reduce HIV risks.

Contribution

The study introduces an AI model for diagnosing female genital schistosomiasis from cervical images, showing promising accuracy.

Findings

01

The AI model achieved an AUC of 0.70 in detecting FGS from cervical images.

02

Higher FGS severity scores correlated with better prediction accuracy by the model.

03

Machine learning shows potential for improving FGS diagnosis in resource-limited settings.

Abstract

Female genital schistosomiasis (FGS) is a sequela of infection with a waterborne parasite prevalent in sub-Saharan Africa and is associated with increased HIV risk. Diagnosis of FGS involves visual colposcopic identification of lesions on the cervix or vaginal walls. Previous studies have utilized digital image processing methods with statistical validation, and more recently, an artificial intelligence (AI)-based approach has also been explored. In this work, we sought to evaluate the performance of an AI model for identifying the presence of FGS from cervical photographs. Colposcopy images were obtained from 340 subjects in Zambia. Ground truth for presence or absence of FGS was determined by trained expert human examiners using visual assessment of images. Examiners also provided a FGS severity score between 0–8 for each image based on the number of lesions and the cervical quadrants…

Linked entities

Genes, proteins, chemicals, diseases, species, mutations and cell lines named across the full text — each resolved to its canonical identifier and authoritative record.

Figures7

Click any figure to enlarge with its caption.

Fig 1 — Overview diagram for the development of the Artificial Intelligence/Machine Learning (AI/ML) tool.A) Colposcopy image acquisition. B) Images were enclosed within a rectangular bounding box to remove unnecessary details and include the cervix ostium and transformation zone. The bounded images were processed to remove specular reflections. C) The ensembled model comprising of pretrained DinoV2, EfficientNet and ResNet were fine-tuned and tested with randomly divided training and test set. D) Results were computed using area under the curve (AUC) sensitivity, specificity, accuracy and F1 score.

Fig 2 — Diagram for inclusion and exclusion of participants for Artificial Intelligence/Machine Learning (AI/ML) tool development.Specular reflection refers to a prominent artifact in colposcopy images that hinders the analysis of lesions.

Fig 3 — Diagram for detecting the specular reflections (SR) on colposcopy images.(A) The S, G, and L components for computing F, correspond to saturation (S), green (G) and luminance (L) channels, obtained from HSV, RGB and CIE-Lab color spaces respectively. (B) SR removal for four FGS positive women, Si ( i=1,2,3,4). The first row shows the bounded image, the second row shows the highlights (in black pixels) detected using method defined in (A), and the third row shows the images with SR removed using diffusion modeling.

Fig 4 — Integrated gradient maps and colposcopy images of True Positive (correctly classified female genital schistosomiasis (FGS), True Negative (correctly classified as no FGS), False Positive (misclassified as FGS), and False Negative (misclassified as no FGS) women.The red bounding box in True Positive column, shows a yellow sandy patch, a characteristic FGS lesion. The Attribution maps show high values of gradient in the corresponding region.

Fig 5 — Rationale for model ensemble A: Heatmap for model agreement rate for three individual models B: Bar plot showing relative improvement of the ensemble model over the three individual models.C: Radar plot illustrating the performance characteristics of all the models D: Prediction of three FGS study participants with the different models compared to the obstetrician/gynecologist (OB/GYN) classifications (ground truth) E: Prediction of five women without FGS using different models along with ground truth.

Fig 6 — Image harmonization effect on ensemble model’s performance.The first row shows colposcopy images corresponding to four pre-processing cases for women with FGS. P1: Cropped + Inpainted, P2 Uncropped+ not-inpainted, P3 - Cropped+not-inpainted. P4-Uncropped+inpainted. The second row illustrates corresponding IG maps for the model. The red bounding box in panel P2 shows the higher values of gradients in background regions without any lesion signature. In panel P1, the model reveals improved gradient highlights in the cervix region, rather than focusing on regions of non-interest.

Fig 7 — Severity score analysis of female genital schistosomiasis (FGS) in test data.A. t-SNE plot of embeddings of study participants in Sv using Efficient Net model, grouped by severity scores. B. Bar graph of number of classified/misclassified women as per severity scores. C. Sample images of correctly classified women for severity score 2. D. Sample images of correctly classified women for severity score 1. E. Sample images of misclassified women for severity score 1. F. Sample images of women misclassified for severity score 2.

Equations2

Funding11

—http://dx.doi.org/10.13039/100000054National Cancer Institute
—http://dx.doi.org/10.13039/100000050National Heart, Lung, and Blood Institute
—http://dx.doi.org/10.13039/100000070National Institute of Biomedical Imaging and Bioengineering
—VA Merit Review Award
—http://dx.doi.org/10.13039/100000865Bill and Melinda Gates Foundation
—http://dx.doi.org/10.13039/100007038Lung Cancer Research Foundation
—http://dx.doi.org/10.13039/100014042DOD Peer Reviewed Cancer Research Program
—Kidney Precision Medicine Project
—http://dx.doi.org/10.13039/100001006Breast Cancer Research Foundation
—http://dx.doi.org/10.13039/100014039DOD Prostate Cancer Research Program
—http://dx.doi.org/10.13039/100000738U.S. Department of Veterans Affairs

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsParasites and Host Interactions · Cervical Cancer and HPV Research · Artificial Intelligence in Healthcare and Education

Full text

Introduction

Infection with Schistosoma haematobium, a parasitic worm found primarily in sub-Saharan Africa, can result from direct contact with contaminated fresh water [1]. Infection can cause female genital schistosomiasis (FGS) which affects an estimated 56 million women and girls in sub-Saharan Africa and is one of the most neglected tropical diseases globally [1]. In endemic countries, women and girls often encounter S. haematobium-contaminated water during daily chores and activities [1]. Schistosomiasis infection in humans begins when larval forms of the parasite, released by infected freshwater snails, penetrate the skin during contact with contaminated water. Transmission is sustained when infected individuals release parasite eggs into freshwater through their urine or feces. These eggs hatch in the water, continuing the parasite’s lifecycle. Once inside the human body, the larvae mature into adult schistosomes that reside in blood vessels. Female worms release eggs, some of which exit the body to infect new water sources, while others become lodged in tissues, triggering immune responses and causing progressive organ damage [2].

FGS is characterized by lesions on the reproductive organs including the cervix which may cause reproductive organ damage, subfertility, pregnancy complications, lost productivity, stigma [3], cervical dysplasia [4,5], and increased HIV risk [6,7].

Female genital schistosomiasis (FGS) remains a largely neglected manifestation of schistosomiasis that is chronically underreported, misdiagnosed, and untreated by the global health community [1,8].The current clinical standard for FGS diagnosis is via visual assessment of the cervix and vaginal walls by trained clinicians either during colposcopy examination or from images of the cervix taken during colposcopy. Clinicians inspect the cervix and vaginal walls for four characteristic FGS lesions: grainy sandy patches, homogenous yellow sandy patches, abnormal blood vessels, or rubbery papules [9]. Challenges with this method in low and middle income countries (LMIC) include a dearth of providers trained to identify FGS, subjective and differing assessments among providers, and limited colposcopy equipment [8]. FGS shares symptoms with sexually transmitted infections (STIs) which can contribute to its misdiagnosis [9] and associated stigma [10,11]. As a result, for women living in S. haematobium endemic areas*,* FGS remains highly prevalent and under-diagnosed [1].

Advancements in FGS diagnostic methods are urgently needed. Limited literature in this domain includes the use of computerized cervical image analysis approaches including colorimetric analysis [12,13], characterization of blood vessels [14], use of a grid to validate cervical lesion proportion [15] and the recent work [16] using Deep learning model YOLO. We undertook this study to evaluate the potential of deep learning-based AI methods to identify FGS using colposcopy images from 340 study participants (also referred to as subjects) from Zambia. The ground truth for the presence of FGS was established by expert obstetrician/gynecologist (OB/GYNs) using visual assessment as per the recommended clinical standard [9]. We also assessed the performance of the AI method (AID-FGS) with respect to the severity scores as well as the impact of image harmonization methods on the AI model’s performance. The methodology for development of the AI/Machine Learning (ML) tool included data acquisition, preprocessing by enclosing the image in a bounding box followed by specular reflection removal, learning the optimal feature representation, and classification to identify images with FGS. Fig 1 illustrates the block diagram of the methodology and the study design to develop the AI/ML algorithm.

Overview diagram for the development of the Artificial Intelligence/Machine Learning (AI/ML) tool.A) Colposcopy image acquisition. B) Images were enclosed within a rectangular bounding box to remove unnecessary details and include the cervix ostium and transformation zone. The bounded images were processed to remove specular reflections. C) The ensembled model comprising of pretrained DinoV2, EfficientNet and ResNet were fine-tuned and tested with randomly divided training and test set. D) Results were computed using area under the curve (AUC) sensitivity, specificity, accuracy and F1 score.

Materials and methods

Study design and sampling

This is a cross-sectional study, nested within a prospective cohort. Members of the cross-sectional study were recruited from the cohort through convenience sampling. Center for Family Health Research Zambia (CFHRZ) nurse counselors approached participants while waiting to attend their annual or quarterly cohort study visits at the Lusaka or Ndola CFHRZ research sites. Women were provided with a description of this study, and those who were interested and provided written informed consent were enrolled by a CFHRZ nurse counselor.

Ethics approval and consent to participate

This study was approved by the Institutional Review Board of Emory University and the University of Zambia Biomedical Research Committee. Participants provided written informed consent.

Participant selection

From March 2020 to December 2021, we recruited 499 women from an existing prospective cohort of women at high risk for HIV in two large urban areas in Zambia (Lusaka and Ndola) [17]. Women in the cohort were either female sex workers (FSW) or single mothers who were at least 18 years of age. Women were referred to the CFHRZ research sites after community outreach at sex worker hotspots or post-natal clinics, respectively.

Survey procedures

At CFHRZ research sites, participants completed baseline surveys in the local language (Nyanja in Lusaka and Bemba in Ndola) to assess demographics; reproductive, gynecological, and urinary history and symptoms; and potential environmental exposures to S. haematobium.

Clinical procedures

Colposcopic exams were conducted on non-menstruating women. First, an autoclaved bivalve speculum was inserted, and genital examination assessed inflammation, contact bleeding, discharge, ulceration, and adenopathy. Then, the cervix was cleared of any discharge; endocervical and vaginal swabs were collected; and finally, a Bovie Colpo-Master CS-105LEDI Swing Arm Colposcope equipped with Continuous Zoom (zoom ratio 1:6.7 (0.67x-4.5x) and magnification 3.9x-27x) was used to take images of the cervix. Images were also taken both before (PRE images) and after a cervical wash with acetic acid (POST images) for visual inspection with an acetic acid (VIA) test. Our expert OB/GYN trained four CFHRZ research doctors and nurses to perform these procedures. Additionally, women provided urine samples for hematuria testing and urine filtration for the detection of S. haematobium eggs by trained laboratory technicians. Participants were also tested for gonorrhea, chlamydia, acute HIV infection, and high-risk human papillomavirus (hrHPV) by PCR; trichomoniasis, candida, bacterial vaginosis by microscopy; syphilis by Rapid Plasma Reagin (RPR); and HIV by rapid test. All testing was conducted by trained laboratory technicians as previously described [17].

Clinical FGS diagnosis (ground truth)

Images taken during colposcopy were downloaded onto a computer for storage. One OB/GYN independently reviewed images applying a standard FGS case definition (i.e., presence of any FGS indicator: grainy sandy patches, homogenous yellow sandy patches, abnormal blood vessels, or rubbery papules) [9]. Each image was reviewed by one OB/GYN (either MI or BV; both are experts in FGS identification and the latter is a co-author of the World Health Organization FGS Pocket Atlas) [9]. The reviewer recorded whether there was the presence of any of the above listed indicators of FGS and where each indicator was located by cervical quadrant. A severity score of 0–8 was then assigned to each participant based on the total number of FGS indicators observed and the number of cervical quadrants involved [18]. Score 0 represents no FGS while 8 represents the highest FGS severity. A higher severity score would indicate presence of either more infected quadrants of the cervix and/or more FGS lesion types. For example, a score of 3 might reflect a case where the infection is spreading but not yet involving the entire cervix. A score of 4 indicates a more advanced or aggressive presentation, either due to broader anatomical involvement or more pronounced diagnostic features. As there are no established severity metrics for FGS, we derived this severity score and previously reported that increased severity was associated with decreased lesion resolution post-treatment [18], indicating the clinical utility of the derived severity score. To reduce bias, gynecological exam, laboratory, and survey findings were unknown to the OB/GYN at the time of image review.

Treatments and referrals

Women with any FGS indicator, egg excretion, or hematuria were treated for free at CFHRZ research sites with praziquantel (40 mg/kg). Women diagnosed with STIs or vaginal dysbioses were treated for free at CFHRZ per Zambian National Guidelines [19] Women who were HIV, VIA, or hrHPV positive were referred for care to a local government health facility.

Data collection and management

Survey data were collected on tablets using SurveyCTO (Dobility, Inc., v2.81.4). Clinical and laboratory data were collected on paper forms and later entered into SurveyCTO. Data were imported weekly from SurveyCTO into MS Access for long-term storage, quality control, and cleaning.

Analysis dataset

Fig 2 shows the block diagram for the inclusion and exclusion of subjects into the AI/ML analytic dataset. In total, 499 women were recruited for study. Among these women, 105 had missing images. Exclusion criterion involved blurred images, cervix ostrium not visible, enlarged view of cervix missing. The inclusion criterion involved ground truth available, good image quality and lesser reduction in image area after removing specular reflections. Following these, we excluded 54 participants and included 340 participants in the study, resulting in 92 women with FGS and 248 women who did not have FGS (referred to as no FGS).

Diagram for inclusion and exclusion of participants for Artificial Intelligence/Machine Learning (AI/ML) tool development.Specular reflection refers to a prominent artifact in colposcopy images that hinders the analysis of lesions.

Preprocessing

Colposcopy images often capture unnecessary features, like the speculum. To exclude these details and computation in the images, colposcopy images ( $[eqn]$ ) (3072 × 1728 pixels) were cropped manually into a rectangular bounding box $[eqn]$ using the Label Studio software [20,21]. Specular reflections (SR) or highlights are strong artifacts that sometimes accompany cervix images. SR appear as bright white areas on cervix images, resulting from light reflecting off the cervix’s wet surface [22]. These highlights disrupt the content analysis of the surrounding region as well as lesions on those sites, necessitating the removal of these artifacts. SR were identified using the method described in a publication [23], except that the threshold after filtering was adjusted to 0.05 grayscale value as per our dataset. Fig 3(A) illustrates the method used for detecting highlights on our dataset. The pixels corresponding to the highlighted regions were filled using a pre-trained diffusion model [24,25]. The diffusion models are generative models used in image synthesis and denoising that produce realistic high-quality images [24]. To retain the original image resolution, a tile-based approach was used to fill the SR regions in $[eqn]$ to generate $[eqn]$ . $[eqn]$ was divided into tiles with dimensions of 512 × 512 pixels. The tiles corrected for highlights were then combined to obtain $[eqn]$ Sample images from our dataset are illustrated in Fig 3(B).

Diagram for detecting the specular reflections (SR) on colposcopy images.(A) The S, G, and L components for computing F, correspond to saturation (S), green (G) and luminance (L) channels, obtained from HSV, RGB and CIE-Lab color spaces respectively. (B) SR removal for four FGS positive women, Si ( i=1,2,3,4). The first row shows the bounded image, the second row shows the highlights (in black pixels) detected using method defined in (A), and the third row shows the images with SR removed using diffusion modeling.

Machine learning

The newly presented AID-FGS machine learning model to classify FGS and no FGS comprises an ensemble of pre-trained models including DinoV2 [26], EfficientNetB0 [27] and ResNet18 [28]. DinoV2 was selected as a representative transformer-based approach because it learns robust visual representations through self-supervised learning without requiring large labeled datasets [26]. DinoV2 has demonstrated strong transfer learning capabilities across diverse visual recognition tasks, including medical imaging applications [29,30]. EfficientNetB0 was chosen for its optimal balance between accuracy and computational efficiency. EfficientNetB0 offers a practical advantage over larger, more computationally demanding models and has shown its potential in transfer learning on medical images [27,31]. A recent survey on cervical cancer algorithms eligible for FGS detection identified ResNet as one of the most frequently used deep learning models in this domain [8], supporting its relevance for our application. We employed transfer learning [32] to fine tune and optimize the aforementioned models. Transfer learning refers to an approach where we leverage the knowledge gained from the source domain to improve learning efficiency and performance on the target domain [32]. Details on Transfer Learning approach used in AID-FGS are described in Section A in S1 Text. Training specific details and parameters are listed in Section B in S1 Text. Along with the ensemble model, DinoV2, EfficientNetB0, and ResNet18 were also used in the same paradigm to evaluate the efficacy of each model for FGS detection. Integrated gradient (IG) maps [33] were obtained to understand the learning patterns of the model. Each model resulted in a heatmap using the gradients of the model’s output with respect to input, in turn highlighting which features of the input were most influential in the model’s overall decision. T-distributed stochastic neighbor embedding (t-SNE) is a dimensionality reduction method that helps to visualize high-dimensional data [34]. It creates a plot in 2D/3D space where essential features are used to show underlying patterns in the data. In our study, t-SNE was used to plot high-dimensional embeddings obtained from ML models for FGS positive participants in $[eqn]$ to observe their grouping with respect to severity scores.

AI methods for detecting cervical cancer

A recent extensive review [8] explored whether AI algorithms designed for cervical cancer/dysplasia detection using colposcopy images could be applied to FGS detection. The methodology described in this study prompted us to explore handcrafted features described by Xu et al. [35] for FGS detection. For each image, we extracted three complementary pyramid features: Pyramid histogram in LAB* color space (PLAB), Pyramid Histogram of Oriented Gradients (PHOG), and Pyramid histogram of Local Binary Patterns (PLBP). Support vector machine was used along with linear and radial basis function (rbf) kernel to evaluate the performance of handcrafted features using uncorrelated features. The collinear features were identified with a Spearman correlation coefficient greater than or equal to 0.6. The performance of these features was also evaluated using a similar holdout test set.

Statistical analysis

Due to the limited number of FGS positive participants in the dataset, we divided the samples into a balanced training set ( $[eqn]$ , FGS = 71, no FGS = 71) and an imbalanced testing set ( $[eqn]$ , FGS = 21, no FGS = 177), reflecting real-world clinical scenarios [36]. Sensitivity, specificity, accuracy, area under the curve (AUC), 95% confidence interval (CI) of AUC, and F1 score were computed to assess the performance of each of the models. The gold standard diagnosis (true label) was taken from the expert OB/GYN. The sensitivity and specificity were computed using the following formula for the predicted labels by the deep learning model.

[eqn]

[eqn]

The F1 score refers to the harmonic mean of precision and recall and is a suitable metric to estimate performance for an imbalanced class distribution [37]. The interpretation of AUC values is defined as AUC = 0.5: no discrimination, 0.5 < AUC < 0.7: poor discrimination, 0.7 ≤ AUC < 0.8: acceptable discrimination, 0.8 ≤ AUC < 0.9 excellent discrimination, AUC ≥ 0.9 outstanding discrimination [38]. The interpretation of sensitivity and specificity is similar to accuracy, the higher the better. We continuously evaluated results on $[eqn]$ during the training phase of the model. The maximum F1 score attained across all training iterations on $[eqn]$ subject to the constraint that the sensitivity and specificity were greater than 0.6 was considered as the criterion for finalizing the trained model. In case the model fails to attain a cut-off of 0.6 for sensitivity and specificity, the highest F1 score was considered for finalizing the trained model. De Long’s method [39] was used for computing the CI of the AUC. Python’s Scikit library [40] and PyTorch [41] were used for the implementation of the algorithm.

Results

Patient characteristics

The distribution of the four typical FGS lesions is described in Table 1. Presence of abnormal blood vessel was the most prominent FGS lesion among participants, while rubbery papules were the least frequent lesion type. Table 2 shows the baseline characteristics for demographics and clinical examination for subjects.

Table 1: Female genital schistosomiasis (FGS) lesion distribution.

Table 2: Baseline characteristics of subject with respect to demographics and clinical examination.

Performance of ensemble model

Best performance at individual participant-level on $[eqn]$ was obtained as AUC = 0.70 (95% CI: 0.58 - 0.82), sensitivity = 0.71, specificity = 0.68, and F1 score = 0.33. The performance metrics on $[eqn]$ were AUC = 0.65 (95% CI: 0.56 - 0.74), sensitivity = 0.58, specificity = 0.72, and F1 = 0.62. Fig 4 shows the IG maps of correctly classified FGS and no FGS participants. A yellow patch is visible in the colposcopy image of the FGS subject (red bounding box in the image). The corresponding pixels in IG show high gradient values in the area of the lesion region, suggesting that the model is capturing the FGS attributes. The performance of other established deep learning models is shown in Table 3. The ROC curves of all the deep learning models are shown in Fig 1(D). The figure suggests that Ensemble models perform better than other models for lower thresholds of False Positive Rate. Fig 5(A) shows the agreement heatmap of models with each other. Fig 5(B) shows the improvement rate of the ensemble model over the individual models. Fig 5(C) shows the radar plot of all the metrics for each model. It can be observed that EfficientNet and Ensemble model have a comparable overlap for all the metrics, except F1 Score. Conversely, ResNet18 attains the highest sensitivity, but performs worse for the other metrics.

Table 3: Performance of different deep learning models.

Integrated gradient maps and colposcopy images of True Positive (correctly classified female genital schistosomiasis (FGS), True Negative (correctly classified as no FGS), False Positive (misclassified as FGS), and False Negative (misclassified as no FGS) women.The red bounding box in True Positive column, shows a yellow sandy patch, a characteristic FGS lesion. The Attribution maps show high values of gradient in the corresponding region.

Rationale for model ensemble A: Heatmap for model agreement rate for three individual models B: Bar plot showing relative improvement of the ensemble model over the three individual models.C: Radar plot illustrating the performance characteristics of all the models D: Prediction of three FGS study participants with the different models compared to the obstetrician/gynecologist (OB/GYN) classifications (ground truth) E: Prediction of five women without FGS using different models along with ground truth.

Impact of preprocessing

We performed an ablation study to analyze the effect of SR removal and cropping on the ensemble model’s performance. We evaluated four types of image pre-processing scenarios ( $[eqn]$ ) on the dataset. $[eqn]$ refers to preprocessing steps as used in AID-FGS (i.e., cropped images of the region of interest and the inpainted regions of SR). $[eqn]$ refers to no preprocessing (original images captured during colposcopy, i.e., uncropped and no inpainting to remove SR). $[eqn]$ refers to cropped images to the region of interest but not inpainted to remove SR. P4 refers to uncropped images but inpainted to remove SR. Table 4 shows the comparison of the performance measures for these approaches. The original images ( $[eqn]$ ) fail to provide any signal to capture FGS. However, cropping the region of interest ( $[eqn]$ ) provides relative improvement of 16.66% on AUC over $[eqn]$ , but is still inferior to AID-FGS ( $[eqn]$ ) because of the presence of artifacts. The lower performance of P4 compared to P1 confirms that cropping plays a pivotal role. Fig 6 shows the IG heatmap for a women with FGS using these four approaches, wherein we observe that the model focuses on the region of interest in AID-FGS (Panel 1) instead of the background region (speculum, wall lining) in Panel 2.

Table 4: Effect of image harmonization method’s on ensemble model’s performance.

Image harmonization effect on ensemble model’s performance.The first row shows colposcopy images corresponding to four pre-processing cases for women with FGS. P1: Cropped + Inpainted, P2 Uncropped+ not-inpainted, P3 - Cropped+not-inpainted. P4-Uncropped+inpainted. The second row illustrates corresponding IG maps for the model. The red bounding box in panel P2 shows the higher values of gradients in background regions without any lesion signature. In panel P1, the model reveals improved gradient highlights in the cervix region, rather than focusing on regions of non-interest.

Performance of ensembled model with respect to severity scores

We converted the severity scores assigned by expert OB/GYN to dichotomous categories by grouping original scores of {1,2,3,4} as Score 1 and scores of {5,6,7,8} as Score 2. Additional detail on grouping of scores is described in Fig A in S1 Text. Fig 7(A) shows the t-SNE plot of embeddings from EfficientNet for FGS subjects in $[eqn]$ , grouped by severity score. The subjects are grouped into clusters as per their scores. In Fig 7(B) we observe the bar plot of correctly classified and misclassified subjects with respect to scores. For the Score 1 grouping, the true prediction rate was 69.23%, while for Score 2 grouping, the true prediction rate was 75%. Fig 7(C-F)) illustrates the sample images of participants classified/misclassified as per the Score category.

Severity score analysis of female genital schistosomiasis (FGS) in test data.A. t-SNE plot of embeddings of study participants in Sv using Efficient Net model, grouped by severity scores. B. Bar graph of number of classified/misclassified women as per severity scores. C. Sample images of correctly classified women for severity score 2. D. Sample images of correctly classified women for severity score 1. E. Sample images of misclassified women for severity score 1. F. Sample images of women misclassified for severity score 2.

Comparison with AI methods for detecting cervical cancer

The handcrafted features resulted in a maximum AUC of 0.52 (95% CI: 0.38-0.63) with sensitivity and specificity of 0.42 and 0.58, respectively, using the rbf kernel. Handcrafted features performed at only chance-level accuracy for FGS detection. These features capture local color, gradient, and texture information of the cervix at different scales. However, the weak FGS characteristics in our dataset were not captured by them.

Discussion

In this study, we present an AI-based model to diagnose FGS using images of the cervix taken during colposcopy. To the best of our knowledge, this is one of the few AI approaches to address the specific problem of FGS diagnosis using images of the cervix taken during colposcopy (AID-FGS). We examined the potential of both ensembled deep learning models as well as individual models to identify FGS. Among all approaches, transfer learning using the ensembled model demonstrated the best performance characteristics. For misclassified FGS subjects, the incorrect predictions could be driven by several factors including (1) the lesions are not visually prominent, (2) low illumination, including the field of view of the cervix is not enlarged and centered enough to capture the relevant features and (3) the SR corrected regions (with high activation values) are incorrectly interpreted as lesions. A previous study that attempted to develop and validate a measure for quantifying cervical lesion proportion in digital images of the cervix [15] found that one of the prominent reasons for disagreement between readers could be attributed to differences in the zoom level of the digital images resulting in different coverage of the cervix surface by the grid.

The impact of image harmonization methods on model’s performance demonstrate that removing artifacts and cropping images to their region of interest helps the model to better focus and capture relevant FGS signatures. However, in the future, the manual cropping of images (to focus on the cervix and transformation zone) could be automated to minimize manual intervention and make the pipeline more accessible for use by less experienced interventionalists. In contrast, the study by Zhu et al. [16] did not perform artifact removal or image cropping. Instead, their approach involved annotating bounding boxes around specific lesions. Thus, while both studies address the challenge of FGS detection, they employ distinct strategies. Our work emphasizes image-level classification, whereas Zhu et al. focused on lesion localization and detection.

New advanced deep learning models are being released with regular frequency. As a result, it is challenging to determine which models to use. We studied the performance of three individual established models as well as an ensemble of these models. None of the individual established models consistently provide correct predictions (Fig 5D and E). For example, for participant S5 (woman without FGS in Fig 5E), none of the individual models yielded accurate predictions. ResNet18 frequently predicted FGS in women who did not have FGS but accurately predicted FGS in women with the characteristic lesions. EfficientNet, which is designed to be more computationally efficient, often achieving better performance with fewer parameters and lower computational costs [27] performed better than DinoV2 and ResNet18, but still not as well as the ensemble model (Fig 5C). Regarding the ensemble design, our objective behind simple averaging was to maintain stability and reduce overfitting on a small dataset. We acknowledge that more sophisticated ensemble strategies, including optimized weighting schemes like attention and integration of diverse architectures such as Vision Transformers could further improve performance.

We also analyzed the performance of the ensembled model with respect to the severity score. Our findings indicate that FGS severity is a factor to consider in the classification performance. Higher grade lesions were easier to identify by machine learning models compared to lower grade lesions. However, further validation is required with additional studies.

Beyond FGS, colposcopy images have been extensively explored for cervical cancer screening. Study by Yuan et al. [42] used two colposcopy images corresponding to one acetic image and one iodine image along with age, HPV testing result, cytology result and type of transformation zone as the input to the deep learning model. The multimodal system attained sensitivity, specificity and accuracy of 85.38%, 82.62% and 84.10% respectively, with an AUC of 0.93 on dataset of 22396 images. Another study by Ouh et al [43] analyzed 9639 Tele-cervicography images, using deep learning based system for cervix detection as well as classification. In a multicenter retrospective study, their system attained sensitivity, specificity of 98%, 95%, respectively. Further, a recent review study by Lei Lu et al [44] concluded that AI systems achieved superior diagnostic accuracy compared to experienced colposcopists. By contrast, when we explored analysis techniques that attained satisfactory performance for identifying cervical dysplasia using acetowhite lesions [35] to our data, the results did not correspond well with ground truth FGS determinations. Comparing cervical cancer AI to FGS AI is essentially an “apples to oranges” comparison. Unlike cervical cancer, which benefits from decades of standardized, large-scale digital screening archives [8,45], FGS research faces significant challenges in data collection due to the lack of specialized training for providers and the hurdles of imaging in endemic, resource-limited settings [46]. Consequently, FGS datasets are inherently smaller [12,15]. The distinct pathophysiology of FGS lesions requires specialized models rather than the mere adaptation of cervical cancer algorithms, and our findings underscore the need for FGS-specific diagnostic frameworks.

Previously published FGS diagnostic techniques have been based on both clinical evaluations [47,48] as well as automated image processing methods using statistical validation. S. haematobium antigen detection [49], egg detection, and molecular diagnostic techniques such as PCR have been used to screen for S. haematobium from vaginal swabs [50–54]. However, these methods only give an indication of past or current infection with S. haematobium and not the presence of FGS lesions. Differences in age profiles, infection chronicity, and diagnostic modality (molecular vs. lesion/visual) introduce important limitations for interpreting our results and comparing across studies. Many young women may have active molecular evidence of infection (egg/antigen/DNA) but may not yet have developed the characteristic visual lesions or transformation zone changes captured in clinical/colposcopic inspections. For example, studies have found that cervicovaginal DNA or antigen markers can be positive in absence of visible lesions [45]. On the other hand, older women may have a lower prevalence of egg excretion (active infection) but higher prevalence of visual lesions (reflecting years of prior infection and tissue change), making comparisons across age groups non-straightforward. Visual methods capture morphological changes rather than direct parasite presence. They may under-detect early disease (in younger women) or over-estimate the risk of active disease in older women when lesions persist, but infection is resolved.

In a previous study [36], our research group explored a simple prediction tool using the demographics, risk factors, and symptom data like strong clinical indicators, visual inspection of cervix with acetic acid and hematuria from surveys. We aimed to derive a useful risk score for FGS. The derivation cohort included 349 participants. We used 5-fold iterative resampling for cross-validation. The risk score was tested in a holdout set of 150 participants. The tool attained a sensitivity of 77% to detect FGS; however, the generalizability of this simple and cost-effective tool is yet to be explored. Our current study builds on this by using advanced AI methods to directly analyze colposcopy images for FGS lesions. This approach aims to provide an accurate, objective diagnosis that does not rely on symptoms alone or on limited expert availability. In this way, the previous tool helped to highlight the need for better diagnostics, and our AI model is a step forward in meeting that need. Together, these efforts are part of a larger goal to improve detection and treatment of FGS, especially in areas with limited healthcare resources.

Studies of computer-aided FGS lesion diagnosis are sparse. One that has been published includes colorimetric analyses of 30 cervical images with yellow sandy patches that were diagnosed by a clinician during colposcopy and achieved a sensitivity of 83% [12]. Subsequent colorimetric analysis of yellow sandy patches by the same research group using almost 700 cases had a sensitivity of 80.5% [13]. This same research group also used morphological analyses of 150 images to identify abnormal blood vessels indicative of FGS with a sensitivity of 78% [14]. These studies were focused on just two of the four indicators of FGS (yellow sandy patches and abnormal blood vessels), and none used modern AI methods, which leaves room for the development of new AI-based FGS-specific diagnostic algorithms. Another research study [15] focused on creating and validating a measure called cervical lesion proportion (CLP) to quantify cervical pathology in FGS. Researchers used a digital imaging technique, overlaying a grid with 424 identical squares on high-resolution images of the cervix from 70 women with FGS. In a similar paradigm, rubbery papule count (RPC) was also computed. The intraclass correlation coefficient for CLP and RPC were 0.94 and 0.88 for inter-rater reliability and 0.90 and 0.80 for intra-rater reliability. The more recent work by Zhu et al. [16] used advanced deep learning model, You Only Look Once (YOLO) to diagnose FGS on 125 subjects with the model trained on 504 subjects. The study performed detection and localization of four key FGS indicators. For each subject 2 images were used and final prediction was made using voting. The model attained sensitivity of 96% and accuracy of 78%. YOLO models have shown their success in another related study by Maturana et al [47] for identifying the Schistosoma haematobium eggs in urine samples obtained from microscopy images. In summary, our study emphasizes robust preprocessing and harmonization, including specular-reflection removal and region-of-interest cropping, which allow the model to extract diagnostic features even from low-quality, variable-illumination images and improve generalizability. However, we acknowledge that external validation has not yet been performed. Our model was developed and tested exclusively on images acquired from Bovie Colpo-Master colposcopes at two Zambian study sites, and we cannot make claims about generalizability to other colposcope devices, imaging conditions, or populations. In future, external validation across different colposcope brands, clinical settings, and endemic regions is essential. We agree that the dataset size and imbalance represent substantive limitation in our study. With 340 subjects and only 92 FGS-positive cases, the dataset may not fully capture the heterogeneity of FGS presentations across different populations, However, the limited number of FGS-positive subjects reflects the rarity of clinically confirmed cases and the challenges of diagnosis and data collection in endemic regions. This constraint likely impacts generalization and contributes to model variance. Hence, in future there is a need for multi-center data collection efforts to build larger, more diverse training cohorts. Such expansion would not only improve statistical power but also enhance model robustness across clinical settings, an essential requirement for scalable, real-world deployment.

The importance of high sensitivity to avoid missing FGS cases is well known, especially because untreated infections can have serious health effects such as kidney damage and fibrosis of the bladder and ureter in advanced cases [2]. However, a certain number of false positive results can cause problems. A subject misdiagnosed as FGS positive can face social stigma in many communities [10]. This stigma can cause discrimination, problems in relationships, and emotional stress for the women affected [10]. It is reported that fear of stigmatisation might hinder women to disclose FGS-associated symptoms [10]. Chronic FGS lesions can be prevented by regular treatment with praziquantel when started at an early age and continued throughout life. But it is not always available in all areas where FGS is common, and shortages have been reported in some places [55]. Using praziquantel for people who do not actually have FGS can put pressure on limited healthcare resources. Praziquantel resistance has not been documented in S. haematobium populations to date thus judicious use of the drug is recommended to preserve its efficacy. Thus, while it is important to catch as many true cases as possible, steps to improve the accuracy of diagnosis and confirm positive results can help reduce negative effects on women, healthcare systems, and drug use.

FGS lesions may be more severe with repeated exposures, potentially reducing the effectiveness of treatment [18]. In the future, an AI tool could be useful not only for diagnosis but also for treatment effect monitoring, especially in women with more severe lesions and in rural endemic areas where repeated exposure is high [1]. Further, in the future, it may be possible to develop a smartphone-based app that could be deployed in clinical settings to diagnose FGS by frontline healthcare workers. Moreover, integrating healthcare workers into AI-assisted diagnostic workflows, as demonstrated in recent studies [16], could enhance training and facilitate broader adoption of FGS diagnostic tools in resource-limited settings. Importantly, such an AI tool has the potential to support point-of-care diagnosis and monitoring using low-cost imaging devices, even in remote areas without access to expert clinicians or colposcopes. With appropriate validation and integration into existing strategies, AI-assisted diagnosis could help guide timely treatment with praziquantel, support follow-up after drug administration, and ultimately contribute to reducing the long-term burden of FGS in endemic communities.

Some limitations in our study warrant consideration. A significant limitation of our approach is that the algorithm focuses primarily on the cervix and transformation zone, with exclusion of the vaginal walls and fornices. FGS is diagnosed by visual inspection of characteristic lesions on the cervix and vaginal wall [9]. Women with vaginal wall lesions alone may not be detected by our current model because vaginal walls are very difficult to consistently visualize and photograph during colposcopy [56]. Anterior and posterior surfaces of the vaginal wall can be inspected by rotating the speculum 90 degrees [57]. Hence, a single image focused on cervix cannot capture vaginal walls. In order, to include analysis of vaginal walls, multiple images need to be captured. Clinicians may easily miss the localized and faint signs of S. haematobium infection within the female lower genital tract [56]. Similarly, FGS may involve the upper reproductive tract, the structures of which are not visualized at all during colposcopy [56]. These are known limitations in the field of FGS diagnosis. Future algorithm development should include additional images of vagina and methods to systematically include and analyze vaginal wall regions. Additionally, complementary diagnostic approaches such as pelvic imaging or histopathological assessment can be done to capture FGS manifestations in less accessible areas of the reproductive tract in future studies to establish more accurate ground truths. Our cohort consisted of sex workers and single mothers; it will be important to analyse whether findings from these subgroups of women are generalizable to the general population. Another limitation was that women positive for FGS with concurrent STIs were not excluded from the analyses. Thus, we were unable to determine if the presence of another infection affected the sensitivity/specificity of the AID-FGS. Trichomoniasis has a weak association with FGS [58].

Conclusions

Our study on the ability of various deep learning models and handcrafted features for diagnosing FGS using images of the cervix taken during colposcopy indicated that the Ensembled model performed better than other counterparts as well as the handcrafted features. The use of AI/machine learning models for identifying FGS from cervical photographs could significantly enhance early diagnosis and treatment, potentially reducing the morbidity associated with FGS and lowering the risk of HIV transmission in affected populations.

Supporting information

S1 TextThis document provides a comprehensive description of the model training setup. It also includes the distribution of severity scores in the test dataset.(DOCX)

Bibliography57

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1UNAIDS. No more neglect: Female genital schistosomiasis and HIV. 2019.
2WHO schistosomiasis [Internet]. [cited 2025 Oct 31]. Available from: https://www.who.int/news-room/fact-sheets/detail/schistosomiasis
3Bustinduy AL, Randriansolo B, Sturt AS, Kayuni SA, Leutscher PDC, Webster BL, et al. An update on female and male genital schistosomiasis and a call to integrate efforts to escalate diagnosis, treatment and awareness in endemic and non-endemic settings: The time is now. In: Advances in Parasitology [Internet]. Elsevier; 2022. pp. 1–44. [cited 2023 Dec 1]. Available from: https://linkinghub.elsevier.com/retrieve/pii/S 0065308 X 2100059210.1016/bs.apar.2021.12.00335249661 · doi ↗ · pubmed ↗
4Rafferty H, Sturt AS, Phiri CR, Webb EL, Mudenda M, Mapani J, et al. Association between cervical dysplasia and female genital schistosomiasis diagnosed by genital PCR in Zambian women. BMC Infect Dis. 2021;21(1):691. doi: 10.1186/s 12879-021-06380-5 34273957 PMC 8286581 · doi ↗ · pubmed ↗
5Kjetland EF, Ndhlovu PD, Mduluza T, Deschoolmeester V, Midzi N, Gomo E, et al. The effects of genital schistosoma haematobium on human papillomavirus and the development of cervical neoplasia after five years in a Zimbabwean population. Eur J Gynaecol Oncol. 2010;31(2):169–73. 20527233 · pubmed ↗
6Wall KM, Kilembe W, Vwalika B, Dinh C, Livingston P, Lee Y-M, et al. Schistosomiasis is associated with incident HIV transmission and death in Zambia. P Lo S Negl Trop Dis. 2018;12(12):e 0006902. doi: 10.1371/journal.pntd.0006902 30543654 PMC 6292564 · doi ↗ · pubmed ↗
7Patel P, Rose CE, Kjetland EF, Downs JA, Mbabazi PS, Sabin K, et al. Association of schistosomiasis and HIV infections: a systematic review and meta-analysis. Int J Infect Dis. 2021;102:544–53. doi: 10.1016/j.ijid.2020.10.088 33157296 PMC 8883428 · doi ↗ · pubmed ↗
8Jin E, Noble JA, Gomes M. A review of computer-aided diagnostic algorithms for cervical neoplasia and an assessment of their applicability to female genital schistosomiasis. Mayo Clin Proc Digit Health. 2023;1(3):247–57. doi: 10.1016/j.mcpdig.2023.04.007 40206624 PMC 11975695 · doi ↗ · pubmed ↗