# Explainable machine-learning-based predictions of blood lead levels and school drinking water contamination among children: a case study in Washington DC

**Authors:** Dylan Darling, Yogesh Bhattarai, Sara Kamanmalek, Rocky Talchabhadel, Sanjib Sharma

PMC · DOI: 10.1038/s41598-025-24213-2 · 2025-11-18

## TL;DR

This study uses machine learning to predict lead contamination in drinking water and blood lead levels in children in Washington DC, identifying high-risk areas and factors.

## Contribution

The novel use of explainable machine learning models to predict and explain lead contamination risks in urban water systems.

## Key findings

- Machine learning models achieved strong predictive performance with AUC between 0.90 and 0.95.
- High-risk zones were identified, particularly in Wards 1, 4, and 6, with lead pipe density and social vulnerability as key predictors.
- Ensemble models outperformed logistic regression in accuracy, precision, and recall.

## Abstract

Water quality degradation poses significant risks to human health, ecosystem, and community. Many cities continue to rely on outdated pipes and water distribution networks that are highly susceptible to leaks, corrosion, and lead contamination. The processes driving lead contamination are evolving with aging infrastructure and changing environment, and there remains a critical challenge for predicting the associated risk. The key objective of this study is to improve the understanding and prediction of blood lead levels and school drinking water contamination among children using explainable machine learning. Focusing on Washington, District of Columbia, where lead exposure remains a persistent concern, we develop and evaluate random forest, adaptive boosting, and gradient boosting models using environmental, topographic, socioeconomic, and infrastructure features as predictive inputs. We then apply Shapley additive explanations to quantify the relative influence of each variable on model outcomes. Results demonstrate strong discriminative ability across all models, with area under the receiver operating characteristic curve ranging from 0.90 to 0.95. Ensemble-based approaches consistently outperform logistic regression, achieving higher accuracy, precision, recall, and F1-scores, along with narrower confidence intervals. Over 11% of the city lies into very high-risk zone, and 13% is classified as a high-risk zone. In particular, Wards 1, 4, and 6 are among the most impacted areas, exhibiting high concentrations of lead service lines and elevated predicted contamination risk. City-wide predictions are primarily driven by lead pipe density and social vulnerability, while school-level risks are more strongly influenced by water infrastructure characteristics, including device type and building age. These findings offer critical insights for guiding targeted interventions such as lead service line replacements, prioritization of high-risk schools, and resource allocation to vulnerable neighborhoods.

## Full-text entities

- **Chemicals:** lead (MESH:D007854)
- **Species:** Homo sapiens (human, species) [taxon 9606]

## Figures

10 figures with captions in the complete paper: https://tomesphere.com/paper/PMC12627720/full.md

---
Source: https://tomesphere.com/paper/PMC12627720