# Obtaining the Most Accurate, Explainable Model for Predicting Chronic Obstructive Pulmonary Disease: Triangulation of Multiple Linear Regression and Machine Learning Methods

**Authors:** Arnold Kamis, Nidhi Gadia, Zilin Luo, Shu Xin Ng, Mansi Thumbar

PMC · DOI: 10.2196/58455 · JMIR AI · 2024-08-29

## TL;DR

This study compares linear and machine learning models to predict COPD rates in the U.S., finding that a gradient boosted tree model is most accurate and highlights key predictors like smoking and income.

## Contribution

The study introduces a gradient boosted tree model as the most accurate and interpretable method for predicting COPD at the CBSA level.

## Key findings

- The most accurate machine learning model explained 85.7% of variance in COPD rates.
- Cigarette smoking and household income were the strongest predictors of COPD.
- Gradient boosted trees outperformed linear models in accuracy and captured nonlinear relationships.

## Abstract

Lung disease is a severe problem in the United States. Despite the decreasing rates of cigarette smoking, chronic obstructive pulmonary disease (COPD) continues to be a health burden in the United States. In this paper, we focus on COPD in the United States from 2016 to 2019.

We gathered a diverse set of non–personally identifiable information from public data sources to better understand and predict COPD rates at the core-based statistical area (CBSA) level in the United States. Our objective was to compare linear models with machine learning models to obtain the most accurate and interpretable model of COPD.

We integrated non–personally identifiable information from multiple Centers for Disease Control and Prevention sources and used them to analyze COPD with different types of methods. We included cigarette smoking, a well-known contributing factor, and race/ethnicity because health disparities among different races and ethnicities in the United States are also well known. The models also included the air quality index, education, employment, and economic variables. We fitted models with both multiple linear regression and machine learning methods.

The most accurate multiple linear regression model has variance explained of 81.1%, mean absolute error of 0.591, and symmetric mean absolute percentage error of 9.666. The most accurate machine learning model has variance explained of 85.7%, mean absolute error of 0.456, and symmetric mean absolute percentage error of 6.956. Overall, cigarette smoking and household income are the strongest predictor variables. Moderately strong predictors include education level and unemployment level, as well as American Indian or Alaska Native, Black, and Hispanic population percentages, all measured at the CBSA level.

This research highlights the importance of using diverse data sources as well as multiple methods to understand and predict COPD. The most accurate model was a gradient boosted tree, which captured nonlinearities in a model whose accuracy is superior to the best multiple linear regression. Our interpretable models suggest ways that individual predictor variables can be used in tailored interventions aimed at decreasing COPD rates in specific demographic and ethnographic communities. Gaps in understanding the health impacts of poor air quality, particularly in relation to climate change, suggest a need for further research to design interventions and improve public health.

## Linked entities

- **Diseases:** chronic obstructive pulmonary disease (MONDO:0005002), COPD (MONDO:0005002)

## Full-text entities

- **Diseases:** COPD (MESH:D029424), Lung disease (MESH:D008171)

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/PMC11393512/full.md

## Figures

6 figures with captions in the complete paper: https://tomesphere.com/paper/PMC11393512/full.md

## References

111 references — full list in the complete paper: https://tomesphere.com/paper/PMC11393512/full.md

---
Source: https://tomesphere.com/paper/PMC11393512