# Evaluating the impact of sex bias on AI models in musculoskeletal ultrasound of joint recess distension

**Authors:** M. Mendez, N. Jafarpisheh, S. Demello, C. Lee, M. Dang, P. N. Tyrrell, Citrawati Wungu, Citrawati Wungu, Citrawati Wungu, Citrawati Wungu, Citrawati Wungu, Citrawati Wungu

PMC · DOI: 10.1371/journal.pone.0332716 · PLOS One · 2025-11-12

## TL;DR

This study examines how sex bias in training data affects AI models for diagnosing joint recess distension in ultrasound images, finding that balanced datasets improve model generalizability.

## Contribution

The study introduces a novel evaluation of sex bias in AI models for musculoskeletal ultrasound, emphasizing the importance of balanced training data for equitable healthcare outcomes.

## Key findings

- AI models trained on female-only datasets showed higher sensitivity but lower specificity for male images.
- Balanced training datasets improved generalizability and reduced sex-based performance disparities.
- Classification heatmaps from balanced models aligned more closely with clinically relevant features across sexes.

## Abstract

With the increasing integration of artificial intelligence (AI) in healthcare, concerns about bias in AI models have emerged, particularly regarding demographic factors. In medical imaging, biases in training datasets can significantly impact diagnostic accuracy, leading to unequal healthcare outcomes. This study assessed the impact of sex bias on AI models for diagnosing knee joint recess distension using ultrasound imaging. We utilized a retrospective dataset from community clinics across Canada, comprising 5,000 de-identified MSKUS images categorized by sex and clinical findings. Two binary convolutional neural network (BCNN) classifiers were developed to detect synovial recess distension and determine patient sex. The dataset was balanced across sex and joint recess distension, with models trained using advanced data augmentation and validated through both individual and mixed demographic scenarios using a 5-fold cross-validation strategy. Our BCNN classifiers showed that AI performance varied significantly based on the training data’s demographic characteristics. Models trained exclusively on female datasets achieved higher sensitivity and accuracy but exhibited decreased specificity when applied to male images, suggesting a tendency to overfit female-specific features. Conversely, classifiers trained on balanced datasets displayed enhanced generalizability. This was evident from the classification heatmaps, which varied less between sexes, aligning more closely with clinically relevant features. The study highlights the critical influence of demographic biases on the diagnostic accuracy of AI models in medical imaging. Our results demonstrate the necessity for thorough cross-demographic validation and training on diverse datasets to mitigate biases. These findings are based on a supervised CNN model; evaluating whether they extend to other architectures, such as self-supervised learning (SSL) methods, foundation models, and Vision Transformers (ViTs), remains a direction for future research.

## Linked entities

- **Species:** Homo sapiens (taxon 9606)

## Full-text entities

- **Species:** Homo sapiens (human, species) [taxon 9606]

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/PMC12611148/full.md

## Figures

3 figures with captions in the complete paper: https://tomesphere.com/paper/PMC12611148/full.md

## References

34 references — full list in the complete paper: https://tomesphere.com/paper/PMC12611148/full.md

---
Source: https://tomesphere.com/paper/PMC12611148