# Predicting the methylation status of CpG islands from read distribution biases

**Authors:** Eldar T. Abdullaev, Dinesh A. Haridoss, Peter F. Arndt

PMC · DOI: 10.1186/s12864-025-12257-7 · 2025-10-30

## TL;DR

This paper introduces a method to predict DNA methylation at CpG islands using ordinary short-read sequencing data by analyzing fragmentation biases.

## Contribution

The novel contribution is a machine learning tool, WGS2meth, that infers methylation status from read distribution biases without bisulfite or long-read sequencing.

## Key findings

- Methylated CpG sites are 30% more susceptible to fragmentation than unmethylated CpG sites.
- The proposed machine learning model accurately predicts methylation status of CpG islands from ordinary sequencing reads.
- The method is implemented as a tool called WGS2meth for individual or aggregated sample analysis.

## Abstract

DNA methylation is an important epigenetic mark that plays a major role in transcriptional regulation, development and genome integrity. There are state-of-the-art methods, such as whole-genome bisulfite sequencing or long-read sequencing, which allow accurate detection of DNA methylation at single-base resolution. However, except for these specialized methods, information about DNA methylation status cannot be obtained directly from ordinary short-read sequencing data. Here we propose an approach to predict the methylation status from mapped read coordinates alone. It relies on previous findings that the DNA fragmentation process during library preparation is not random, but is affected by sequence context. In particular, DNA shearing leads to preferential hydrolysis of the sugar-phosphate backbone at CpG dinucleotides. Notably, methylated CpGs are approximately 30% more susceptible to fragmentation than unmethylated CpGs, likely due to subtle differences in the conformational dynamics. These differences are getting prominent when multiple NGS reads at CpG islands are analyzed. Our trained machine learning model is able to detect these biases and predict whether a CpG island of interest is methylated or not. We provide our methods as a tool, WGS2meth, that predicts CpG island methylation from whole-genome sequencing reads of individual or aggregated samples.

The online version contains supplementary material available at 10.1186/s12864-025-12257-7.

## Full-text entities

- **Genes:** DNASE1 (deoxyribonuclease 1) [NCBI Gene 1773] {aka DNL1, DRNI}, SHROOM4 (shroom family member 4) [NCBI Gene 57477] {aka MRXSSDS, SHAP, shrm4}, DNASE1L3 (deoxyribonuclease 1L3) [NCBI Gene 1776] {aka D3, DHP2, DNAS1L3, LSD, SLEB16}, DFFB (DNA fragmentation factor subunit beta) [NCBI Gene 1677] {aka CAD, CPAN, DFF-40, DFF2, DFF40}
- **Diseases:** PMD (MESH:D020371), osteosarcoma (MESH:D012516), pancreatic adenocarcinoma (MESH:D010190), Cancer (MESH:D009369), colon adenocarcinoma cell (MESH:D003110)
- **Chemicals:** TA dinucleotide (-), S (MESH:D013455), CpG (MESH:C015772), thymidine (MESH:D013936), dinucleotides (MESH:D015226), U (MESH:D014501), bisulfite (MESH:C042345), T (MESH:D014316), sugar (MESH:D000073893), C (MESH:D002244), 5-methylcytosine (MESH:D044503), cytosine (MESH:D003596), Sugar-phosphate (MESH:D013403)
- **Species:** Homo sapiens (human, species) [taxon 9606], Bos taurus (bovine, species) [taxon 9913], Mus musculus (house mouse, species) [taxon 10090]
- **Cell lines:** T84 — Homo sapiens (Human), Colon adenocarcinoma, Cancer cell line (CVCL_0555), NUGC3 — Homo sapiens (Human), Gastric adenocarcinoma, Cancer cell line (CVCL_1612), TE11 — Homo sapiens (Human), Esophageal squamous cell carcinoma, Cancer cell line (CVCL_1761), DANG — Homo sapiens (Human), Pancreatic adenocarcinoma, Cancer cell line (CVCL_0243), SHP77 — Homo sapiens (Human), Lung small cell carcinoma, Cancer cell line (CVCL_1693), SW579 — Homo sapiens (Human), Thyroid gland squamous cell carcinoma, Cancer cell line (CVCL_3603), SKMEL30 — Homo sapiens (Human), Cutaneous melanoma, Cancer cell line (CVCL_0039), SKHEP1 — Homo sapiens (Human), Liver and intrahepatic bile duct epithelial neoplasm, Cancer cell line (CVCL_0525), SKIN — Mus musculus (Mouse), Hybridoma (CVCL_B7CU), LARGE — Homo sapiens (Human), Chronic myelogenous leukemia, BCR-ABL1 positive, Cancer cell line (CVCL_SV31), U2OS_BONE — Homo sapiens (Human), Osteosarcoma, Cancer cell line (CVCL_0042), line — Mus musculus (Mouse), Adenoma of the mouse pulmonary system, Cancer cell line (CVCL_5V03)

## Figures

6 figures with captions in the complete paper: https://tomesphere.com/paper/PMC12574136/full.md

---
Source: https://tomesphere.com/paper/PMC12574136