# Evaluation of imputation strategies for multi-centre studies: Application to a large clinical pathology dataset

**Authors:** Lucy Grigoroff, Reika Masuda, John Lindon, Janonna Kadyrov, Jeremy K. Nicholson, Elaine Holmes, Julien Wist

PMC · DOI: 10.1371/journal.pone.0335852 · PLOS One · 2025-11-20

## TL;DR

The paper compares methods to handle missing data in large clinical datasets, finding that the missForest method performs better than MICE.

## Contribution

The study evaluates and compares RF-based imputation methods for handling missing data in multi-centre clinical datasets.

## Key findings

- MissForest outperformed MICE in imputing missing data.
- Stratification negatively impacted MICE performance but had minimal effect on missForest.
- MissForest is robust and capable of automatic variable selection.

## Abstract

As part of a strategy for accommodating missing data in large heterogeneous datasets, two Random Forest-based (RF) imputation methods, missForest and MICE were evaluated along with several strategies to help navigate the inherently incomplete structure of the dataset. Background: A total of 3817 complete cases of clinical chemistry variables from a large-scale, multi-site preclinical longitudinal pathology study were used as an evaluation dataset. Three types of ‘missingness’ in various proportions were artificially introduced to compare imputation performance for different strategies including variable inclusion and stratification. Results: MissForest was found to outperform MICE, being robust and capable of automatic variable selection. Stratification had minimal effect on missForest but severely deteriorated the performance of MICE. Conclusion: In general, storing and sharing datasets prior to any correction is a good practise, so that imputation can be performed on merged data if necessary.

## Full-text entities

- **Genes:** Gpt (glutamic pyruvic transaminase, soluble) [NCBI Gene 76282] {aka 1300007J06Rik, 2310022B03Rik, ALT, ALT1, Gpt-1, Gpt1}, Slc17a5 (solute carrier family 17 (anion/sugar transporter), member 5) [NCBI Gene 235504] {aka 4631416G20Rik, 4732491M05, AST, ISSD, NSD, SD}
- **Diseases:** Toxicity (MESH:D064420)
- **Chemicals:** Urea Nitrogen (MESH:C530477), Creatinine (MESH:D003404), YAll (-)
- **Species:** Mus musculus (house mouse, species) [taxon 10090], Rattus norvegicus (brown rat, species) [taxon 10116]

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/PMC12633936/full.md

## Figures

8 figures with captions in the complete paper: https://tomesphere.com/paper/PMC12633936/full.md

## References

43 references — full list in the complete paper: https://tomesphere.com/paper/PMC12633936/full.md

---
Source: https://tomesphere.com/paper/PMC12633936