# A population spatialization method based on the integration of feature selection and an improved random forest model

**Authors:** Zhen Zhao, Hongmei Guo, Xueli Jiang, Ying Zhang, Changjiang Lu, Can Zhang, Zonghang He

PMC · DOI: 10.1371/journal.pone.0321263 · PLOS One · 2025-04-03

## TL;DR

This paper introduces a new method for mapping population distribution by combining feature selection techniques with an improved random forest model to enhance accuracy.

## Contribution

The novelty lies in integrating feature selection with an improved random forest model to address unbalanced datasets in population spatialization.

## Key findings

- The MDA-RF model achieved the lowest MAPE of 0.174 and highest R2 of 0.913 among tested models.
- The improved random forest model increased prediction accuracy by 1.7% compared to MDA-RF.
- The proposed method outperformed the WorldPop dataset with lower MRE and RMSE values.

## Abstract

Ascertaining the precise and accurate spatial distribution of population is essential in conducting effective urban planning, resource allocation, and emergency rescue planning. The random forest (RF) model is widely used in population spatialization studies. However, the complexity of population distribution characteristics and the limitations of the RF model in processing unbalanced datasets affect population prediction accuracy. To address these issues, a population spatialization model that integrates feature selection with an improved random forest is proposed herein. Firstly, recursive feature elimination using cross validation (RFECV), maximum information coefficient (MIC), and mean decrease accuracy (MDA) methods were utilized to select population distribution feature factors. The random forest was constructed using feature subsets that were selected via different feature selection methods, namely MIC-RF, RFECV-RF and MDA-RF. Subsequently, the feature factors corresponding to the model with the highest accuracy were selected as the optimal feature subsets and used in the model construction as input data. Additionally, considering the imbalanced in population spatial distribution, we used the K-means ++ clustering algorithm to cluster the optimal feature subset, and we used the bootstrap sampling method to extract the same amount of data from each cluster and fuse it with the training subset to build an improved random forest model. Based on this model, a spatial population distribution dataset of the Southern Sichuan Economic Zone at a 500m resolution was generated. Finally, the population dataset generated in this study was compared and validated with the WorldPop dataset. The results showed that utilizing feature selection methods improves model accuracy to varying degrees compared with RF based on all factors, and the MDA-RF had the lowest MAPE of 0.174 and the highest R2 of 0.913 among them. Therefore, feature factors selection using the MDA method was considered the optimal feature subset. Compared with MDA-RF, the prediction accuracy of the improved RF built on the same subset increased by 1.7%, indicating that improving the bootstrap sampling of random forest by using the K-means++ clustering algorithm can enhance model accuracy to some extent. Compared with the WorldPop dataset, the accuracy of the results predicted using the proposed method was enhanced. The MRE and RMSE of the WorldPop dataset were 57.24 and 23174.98, respectively, while the MRE and RMSE of the proposed method were 25.00 and 15776.50, respectively. This implies that the method proposed in this paper could simulate population spatial distribution more accurately.

## Full-text entities

- **Diseases:** MDA (MESH:D009123), Heart Disease (MESH:D006331)
- **Chemicals:** MDA (-)

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/PMC11968112/full.md

## Figures

10 figures with captions in the complete paper: https://tomesphere.com/paper/PMC11968112/full.md

## References

56 references — full list in the complete paper: https://tomesphere.com/paper/PMC11968112/full.md

---
Source: https://tomesphere.com/paper/PMC11968112