# Improving Genotype Imputation in High‐Dimensional Pharmacogenomics Using Multiple Imputation: Evaluation with Machine Learning Approaches

**Authors:** Innocent G. Asiimwe, Tao You, Daniel F. Carr, Munir Pirmohamed, Geraint Davies, Andrea L. Jorgensen

PMC · DOI: 10.1002/cpt.70171 · Clinical Pharmacology and Therapeutics · 2025-12-17

## TL;DR

This paper shows that using multiple imputation improves accuracy in predicting genetic data for drug responses, outperforming single imputation and traditional methods in reliability and discovery.

## Contribution

The novel contribution is a multiple imputation framework that integrates genotype probabilities and uncertainty to improve imputation in high-dimensional pharmacogenomics.

## Key findings

- Multiple imputation achieved coverage exceeding 90% in simulations, unlike single imputation.
- Multiple imputation recovered known pharmacogenomic associations and detected new genome-wide signals missed by single imputation.
- Penalized regression and GWAS with random forest performed best in SNP selection for high- and low-effect scenarios, respectively.

## Abstract

Multiple imputation is well‐established for handling missing data, yet its use in high‐dimensional genetic datasets remains limited. Using pharmacokinetic tuberculosis simulations and SNP data (1000 Genomes Project), we compared machine learning (ML) and traditional approaches (e.g., mean imputation and complete‐case analysis) for imputation and covariate selection. We developed a multiple imputation framework incorporating genotype probabilities, imputation uncertainty (INFO score), and missingness percentages. Dimensionality reduction enabled scalable random forest and penalized regression for covariate selection. In simulations, only multiple imputation achieved adequate coverage (percentage of 95% confidence intervals containing the true value) exceeding a 90% nominal threshold. For example, on the imputation server, coverage improved from 0% with single imputation to up to 94% under 10% missingness. Applied to clinical warfarin datasets (War‐PATH, n = 548; IWPC, n = 316) and the UK Biobank (n = 500, 1000), multiple imputation recovered known pharmacogenomic associations (CYP2C9
*8/*9/*
11; VKORC1 ‐1639G>A), reduced false‐positives, and detected signals missed by single imputation (e.g., genome‐wide significant rs4697699, SLC2A9 locus). Computational costs were modest, adding only ~1.25 minutes for 10 imputations to the 22.7 minutes required by single imputation on the Michigan Imputation Server. For SNP selection, penalized regression performed best in the high‐effect scenario (F1 = 0.897 ± 0.091), while GWAS followed by random forest performed best in the low‐effect scenario (F1 = 0.657 ± 0.110). These findings show that multiple imputation improves reliability and discovery in high‐dimensional pharmacogenomics, with ML offering promising but inconsistent benefits during SNP selection. However, generalizability beyond the studied datasets and computational scalability to larger biobank‐scale analyses remain important limitations that warrant further investigation.

## Linked entities

- **Genes:** CYP2C9 (cytochrome P450 family 2 subfamily C member 9) [NCBI Gene 1559], VKORC1 (vitamin K epoxide reductase complex subunit 1) [NCBI Gene 79001]
- **Diseases:** tuberculosis (MONDO:0018076)

## Full-text entities

- **Genes:** VKORC1 (vitamin K epoxide reductase complex subunit 1) [NCBI Gene 79001] {aka EDTP308, MST134, MST576, VKCFD2, VKOR}, CYP2C9 (cytochrome P450 family 2 subfamily C member 9) [NCBI Gene 1559] {aka CPC9, CYP2C, CYP2C10, CYPIIC9, P450-2C9, P450IIC9}, SLC2A9 (solute carrier family 2 member 9) [NCBI Gene 56606] {aka GLUT9, GLUTX, UAQTL2, URATv1}
- **Chemicals:** warfarin (MESH:D014859)
- **Mutations:** -1639G>A, rs4697699

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/PMC12997507/full.md

## Figures

6 figures with captions in the complete paper: https://tomesphere.com/paper/PMC12997507/full.md

## References

28 references — full list in the complete paper: https://tomesphere.com/paper/PMC12997507/full.md

---
Source: https://tomesphere.com/paper/PMC12997507