# Performance of an Automated Algorithm Grading Surgery-Related Adverse Events According to the Clavien-Dindo Classification: A Systematic Review

**Authors:** Mohammed Abutalib Elmobark Gafar, Shashwat Shetty, Muhammad Qaiser Aziz, Eithar Diaeldin Awad Abdelrahman, Mohammed Muatasim Abbas Eltoom, Bilal Ahmad

PMC · DOI: 10.7759/cureus.100960 · Cureus · 2026-01-06

## TL;DR

This paper reviews automated tools for grading surgical complications using the Clavien-Dindo Classification, showing they can match expert assessments with high accuracy.

## Contribution

The study systematically evaluates the performance of automated algorithms for grading surgical complications using the Clavien-Dindo Classification.

## Key findings

- Rule-based algorithms achieved a Cohen’s κ of up to 0.89 when compared to expert reviewers.
- LLM/NLP approaches reached an accuracy of approximately 97% and a Cohen’s κ of up to 0.92.
- ML models reported an AUC of 0.863 for predicting severe complications (CDC ≥ III).

## Abstract

Postoperative adverse events (AEs) significantly impact patient outcomes and healthcare resources. The Clavien-Dindo Classification (CDC) is widely used to grade surgical complications, but manual grading is labor-intensive and subject to inter-observer variability. Automated algorithms, including rule-based, machine learning (ML), and large language model (LLM)-based natural language processing (NLP) tools, offer scalable solutions for consistent complication grading. A systematic review was conducted following PRISMA 2020 guidelines. Databases searched included PubMed, Embase, Scopus, and Cochrane Library. Studies reporting automated grading of surgery-related AEs using the CDC as a reference, with human validation, were included. Data extraction covered algorithm type, sample size, surgical population, comparator, data source, performance metrics, and outcomes. Three studies met the inclusion criteria, encompassing a total of 1,661 surgical cases. Automated algorithms for Clavien-Dindo Classification (CDC) grading including rule-based systems, machine-learning (ML) models, and large language model (LLM)/natural language processing (NLP) approaches demonstrate high agreement with expert reviewers, with rule-based algorithms achieving Cohen’s κ up to 0.89, ML prediction models reporting discrimination up to an AUC of 0.863 for severe (CDC ≥ III) complications, and LLM/NLP approaches reaching accuracy of approximately 97% and Cohen’s κ up to 0.92. Together, these methods show potential for scalable and, in some settings, near-real-time postoperative complication monitoring. These tools may support clinical decision-making, research, and quality improvement with promising but preliminary applicability across surgical domains. However, conclusions are limited by the small number of available studies and heterogeneity in surgical settings.

## Full-text entities

- **Diseases:** postoperative complication (MESH:D011183)
- **Species:** Homo sapiens (human, species) [taxon 9606]

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/PMC12876034/full.md

## Figures

1 figure with captions in the complete paper: https://tomesphere.com/paper/PMC12876034/full.md

## References

16 references — full list in the complete paper: https://tomesphere.com/paper/PMC12876034/full.md

---
Source: https://tomesphere.com/paper/PMC12876034