# Accurate and interpretable prediction of chemical oxygen demand using explainable boosting algorithms with SHAP analysis

**Authors:** Khaled Merabet, Sungwon Kim, Salim Heddam, Fabio Di Nunno, Francesco Granata, Ozgur Kisi, Rana Muhammad Adnan, Mohammad Zounemat-Kermani, Christoph Külls

PMC · DOI: 10.1038/s41598-026-38757-4 · Scientific Reports · 2026-02-13

## TL;DR

This study compares machine learning models to predict chemical oxygen demand in water, finding that NGBoost provides the most accurate and interpretable results.

## Contribution

The novel contribution is the use of NGBoost for COD prediction, which provides probabilistic outputs and uncertainty quantification.

## Key findings

- NGBoost achieved the highest predictive accuracy at Toilchun with R = 0.979 and NSE = 0.958.
- SHAP analysis identified TOC, BOD₅, and SS as the most influential variables for COD prediction.
- NGBoost's probabilistic outputs allow for better quantification of COD variability and model uncertainty.

## Abstract

Accurate prediction of Chemical Oxygen Demand (COD) is vital for effective water quality management and pollution control. This study compares six ensemble boosting models, AdaBoost, CatBoost, XGBoost, LightGBM, HistGBRT, and NGBoost, for estimating COD from multiple water quality parameters, including pH, dissolved oxygen, suspended solids, and specific conductance. Data from two monitoring stations in South Korea (Toilchun and Hwangji) were used to train and validate the models. Model performance was evaluated using RMSE, MAE, R, NSE, and PBIAS, while interpretability was assessed through SHapley Additive exPlanations (SHAP). Results showed that NGBoost achieved the highest predictive accuracy at Toilchun (R = 0.979, NSE = 0.958, RMSE = 0.397 mg/L), while CatBoost performed best at Hwangji (R = 0.861, NSE = 0.733, RMSE = 0.477 mg/L). As NGBoost provides predictive probability distributions rather than single estimates, its results also reflect model uncertainty, supporting a more robust quantification of COD variability. SHAP analysis identified total organic carbon (TOC), biochemical oxygen demand (BOD₅), and suspended solids (SS) as the most influential variables controlling COD dynamics.

The online version contains supplementary material available at 10.1038/s41598-026-38757-4.

## Full-text entities

- **Chemicals:** Chemical Oxygen (MESH:D010100), water (MESH:D014867), BOD5 (-)

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/PMC12905304/full.md

## Figures

13 figures with captions in the complete paper: https://tomesphere.com/paper/PMC12905304/full.md

## References

3 references — full list in the complete paper: https://tomesphere.com/paper/PMC12905304/full.md

---
Source: https://tomesphere.com/paper/PMC12905304