# An Unsupervised Error Detection Methodology for Detecting Mislabels in Healthcare Analytics

**Authors:** Pei-Yuan Zhou, Faith Lum, Tony Jiecao Wang, Anubhav Bhatti, Surajsinh Parmar, Chen Dan, Andrew K. C. Wong

PMC · DOI: 10.3390/bioengineering11080770 · Bioengineering · 2024-07-31

## TL;DR

This paper introduces a new unsupervised method to detect errors in healthcare data, improving clustering and classification accuracy for sepsis risk assessment.

## Contribution

A novel unsupervised error detection method using the PDD model for identifying mislabeled samples in healthcare datasets.

## Key findings

- The method outperformed K-Means by 38% on the full dataset and 47% on the reduced dataset for clustering.
- Supervised classifiers improved accuracy by an average of 4% after removing detected abnormal samples.
- The approach generates an interpretable knowledge base for better decision-making in clinical settings.

## Abstract

Medical datasets may be imbalanced and contain errors due to subjective test results and clinical variability. The poor quality of original data affects classification accuracy and reliability. Hence, detecting abnormal samples in the dataset can help clinicians make better decisions. In this study, we propose an unsupervised error detection method using patterns discovered by the Pattern Discovery and Disentanglement (PDD) model, developed in our earlier work. Applied to the large data, the eICU Collaborative Research Database for sepsis risk assessment, the proposed algorithm can effectively discover statistically significant association patterns, generate an interpretable knowledge base for interpretability, cluster samples in an unsupervised learning manner, and detect abnormal samples from the dataset. As shown in the experimental result, our method outperformed K-Means by 38% on the full dataset and 47% on the reduced dataset for unsupervised clustering. Multiple supervised classifiers improve accuracy by an average of 4% after removing abnormal samples by the proposed error detection approach. Therefore, the proposed algorithm provides a robust and practical solution for unsupervised clustering and error detection in healthcare data.

## Full-text entities

- **Diseases:** sepsis (MESH:D018805)

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/PMC11351123/full.md

## Figures

8 figures with captions in the complete paper: https://tomesphere.com/paper/PMC11351123/full.md

## References

34 references — full list in the complete paper: https://tomesphere.com/paper/PMC11351123/full.md

---
Source: https://tomesphere.com/paper/PMC11351123