# Assessing imputation techniques for missing data in small and multicollinear datasets: insights from craniofacial morphometry

**Authors:** Norli Anida Abdullah, Firdaus Hariri, Mohamad Norikmal Fazli Hisam, Siti Fatimah Binti Hassan

PMC · DOI: 10.1186/s12874-025-02762-4 · 2026-02-04

## TL;DR

This study compares methods for filling in missing data in small, complex craniofacial datasets and finds that random forest imputation works best.

## Contribution

Identifies random forest as the most effective imputation method for small, high-dimensional, and correlated craniofacial datasets.

## Key findings

- Random Forest (RF) imputation had the lowest RMSE and MAE, showing high accuracy in filling missing data.
- RF preserved dataset variability better than other methods, with a variance preservation score of 0.8961.
- MICE had the closest variance preservation to original data but lower accuracy compared to RF.

## Abstract

Analyses of craniofacial morphology are essential for various medical and research applications, including the study of midfacial development, dysmorphologies, and planning surgical interventions. Incomplete CT scans often due to patient movement, imaging artifacts, or obscured landmarks which can result in missing data. If not properly addressed, such missingness may bias conclusions and weaken statistical power.

This paper evaluates imputation techniques to identify the most suitable method for handling missing completely at random values in small, high-dimensional, and highly correlated craniofacial morphometric datasets.

42 craniofacial variables were measured from 32 observations. The missing data structure was set to be at random with 268 (20%) missing values. Five common imputation techniques namely Mean/Median imputation, k-Nearest Neighbors (kNN), Multiple Imputation by Chained Equations (MICE), Random Forest (RF), and Decision Tree, were considered. The performance of the imputation technique was quantified using Root Mean Squared Error (RMSE), Mean Absolute Error (MAE), and Variance Preservation.

RF Imputation demonstrated the best overall performance, with the lowest RMSE (1.3987) and MAE (0.4902), indicating a high level of accuracy in imputing missing values. It also maintained a relatively close to 1 variance preservation (0.8961), suggesting its effectiveness in retaining the original variability in the dataset. MICE present lower accuracy with high RMSE (3.0869) and MAE (1.1246) however appear to have the closest variance preservation to 1 (1.0580).

The findings emphasize the importance of choosing suitable imputation techniques for small, high-dimensional, and correlated datasets such as those in craniofacial morphometry. RF emerged as the most effective method, offering a strong balance between accuracy and variance preservation.

## Full-text entities

- **Species:** Homo sapiens (human, species) [taxon 9606]

## Figures

7 figures with captions in the complete paper: https://tomesphere.com/paper/PMC12964763/full.md

---
Source: https://tomesphere.com/paper/PMC12964763