# Multi-output learning for systematic missing value imputation in DNA methylation arrays

**Authors:** Tao Ma, Jinfu Nie, Jian Huang, Yong-Biao Zhang, Joanna M Biernacka, Liguo Wang

PMC · DOI: 10.1093/bioadv/vbag052 · Bioinformatics Advances · 2026-02-15

## TL;DR

This paper introduces a new method to fill in missing data in DNA methylation arrays, improving data integration and epigenetic analysis across different array versions.

## Contribution

A two-stage imputation framework using multi-output machine learning to address systematic missing values in DNA methylation data.

## Key findings

- The framework outperforms conventional imputation methods on real datasets with up to 50% missingness.
- It enables accurate cross-platform integration between methylation arrays and sequencing data.
- Imputing missing sites improves the accuracy of epigenetic age prediction models.

## Abstract

Illumina DNA methylation arrays have evolved rapidly, expanding genomic coverage while introducing backward incompatibilities by removing many CpG sites present in earlier versions. These changes result in systematic missing values when integrating data across array generations and substantially limiting the reuse of legacy datasets.

We developed a two-stage framework for imputing missing DNA methylation values. The procedure first imputes randomly missing values using standard imputation techniques and then addresses systematic missingness using multi-output machine learning models, including support vector regression, nearest-neighbor methods, random forest models, and deep neural networks. When evaluated on real datasets with up to fifty percent induced missingness, the proposed framework consistently outperformed conventional imputation approaches. It also accurately imputes the missing CpG sites between methylation arrays and reduced representation bisulfite sequencing data, enabling robust cross-platform data integration. Analyses of large brain tumor methylation datasets demonstrate that the method restores array-specific methylation patterns while preserving biological complexity. Importantly, imputing missing methylation sites significantly improves the performance of epigenetic age prediction models.

This tool is implemented in the Python package “ultra-impute,” freely available at https://github.com/liguowang/ultra-impute. A code snippet demonstrating the usage of the ultra-impute package is provided in a Jupyter Notebook (https://github.com/liguowang/ultra-impute/blob/master/doc/Tutorial.ipynb).

## Linked entities

- **Diseases:** brain tumor (MONDO:0021211)

## Full-text entities

- **Diseases:** bipolar disorder (MESH:D001714), GBM (MESH:D005909), brain tumor (MESH:D001932), Cancer (MESH:D009369)
- **Chemicals:** DNN (-)
- **Species:** Homo sapiens (human, species) [taxon 9606], Mus musculus (house mouse, species) [taxon 10090]
- **Mutations:** UG3 CA

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/PMC12955846/full.md

## Figures

4 figures with captions in the complete paper: https://tomesphere.com/paper/PMC12955846/full.md

## References

14 references — full list in the complete paper: https://tomesphere.com/paper/PMC12955846/full.md

---
Source: https://tomesphere.com/paper/PMC12955846