# Multiple imputation for missing values in ordinal variables from cancer registry data when performing Cox proportional hazards regression

**Authors:** Anika Kästner, Wolfgang Hoffmann, Johannes Hüsing, Andreas Stang, Anika Hüsing

PMC · DOI: 10.1186/s12874-026-02790-8 · BMC Medical Research Methodology · 2026-02-06

## TL;DR

This paper evaluates methods for handling missing data in cancer registry studies, finding that MICE with POLYREG performs best in certain scenarios.

## Contribution

The study introduces a simulation-based evaluation of multiple imputation methods for ordinal variables in cancer registries with missing data.

## Key findings

- MICE with POLYREG showed low bias across all scenarios with large sample sizes.
- MICE with RF and PMM performed well with up to 50% missing data.
- Smaller sample sizes and low category prevalence introduced severe bias regardless of method.

## Abstract

Scientists working with cancer registry data are often confronted with large proportions of missing values in ordinal variables, such as tumor stage, grading or the general health status (ECOG-PS scored 0 to 5). Despite the long-standing issue, research on handling missing ordinal cancer registry data remains sparse.

A simulation study was conducted using complete lung cancer cases (2019–2022) from the North Rhine-Westphalia Cancer Registry. Missing values in ECOG-PS were generated with varying missingness mechanisms (MCAR, MAR, MNAR), missingness proportions (10% to 50%) and sample sizes (N = 500, N = 1,000, N = 5,000). The data were then replaced using MICE with ordinal logistic regression (POLR), multinomial regression (POLYREG), predictive mean matching (PMM), random forests (RF), and the joint model (JM). The performance parameters bias, MSE, width of the 95%CI and coverage were assessed.

Severe bias, high MSE, wide 95%CI, and poor coverage were found in scenarios with sample sizes of N = 500 and 1,000 and 30% or more missing data with low prevalence of ECOG-PS = 4. MICE with POLYREG maintained low bias across all scenarios with N = 5,000, while MICE with RF and PMM performed well with up to 30%-50% missing data. MICE with POLR and the JM yielded low bias with up to 10%-20% missing data. Compared to complete case analysis, MI did not offer a systematic advantage in terms of bias or MSE compared to the MI methods evaluated.

Sample size and ordinal category distribution impact missing data handling in registry studies. Severe bias might be introduced when sample sizes are smaller and prevalence of categories is low, indicating finite-sample effects rather than systematic bias of the imputation methods. Among the MI methods applied, MICE with POLYREG performed best, however, further research is needed for time-to-event analyses and multivariate missingness patterns.

The online version contains supplementary material available at 10.1186/s12874-026-02790-8.

## Linked entities

- **Diseases:** lung cancer (MONDO:0005138)

## Full-text entities

- **Diseases:** cancer (MESH:D009369)

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/PMC12930733/full.md

## Figures

6 figures with captions in the complete paper: https://tomesphere.com/paper/PMC12930733/full.md

## References

14 references — full list in the complete paper: https://tomesphere.com/paper/PMC12930733/full.md

---
Source: https://tomesphere.com/paper/PMC12930733