# EpiSmokEr2: a robust epigenetic classifier for smoking status inference using Illumina EPIC methylation data

**Authors:** Tianyu Zhu, Teodóra Faragó, Sailalitha Bollepalli, Aino Heikkinen, Mikaela Hukkanen, Olli Raitakari, Terho Lehtimäki, Tellervo Korhonen, Jaakko Kaprio, Fang Fang, Kaitlyn G. Lawrence, Dale P. Sandler, Mari Roberts Spildrejorde, Kristina Gervin, Yanyu Pan, Ricardo Costeira, Jordana T Bell, Miina Ollikainen

PMC · DOI: 10.1080/17501911.2026.2630841 · Epigenomics · 2026-02-17

## TL;DR

EpiSmokEr2 is a DNA methylation-based tool that accurately identifies smoking status from blood samples, even when data is incomplete.

## Contribution

EpiSmokEr2 is a novel, robust DNAm classifier for smoking status inference using 511 CpGs from the EPIC array.

## Key findings

- EpiSmokEr2 achieved 87% sensitivity and 86% specificity in identifying current versus never smokers.
- The classifier correlated strongly with established smoking-related DNAm scores and GrimAge.
- EpiSmokEr2 remains robust even with up to 10% missing CpG data.

## Abstract

Tobacco smoking induces persistent DNA methylation (DNAm) changes in blood that can serve as long-term biomarkers for smoking exposure. We aimed to develop and validate a DNAm classifier of smoking status using Illumina EPIC array data.

We built Epigenetic Smoking status Estimator2 (EpiSmokEr2), a Least Absolute Shrinkage and Selection Operator (LASSO) regression-based DNAm classifier using 511 CpGs from Illumina Infinium MethylationEPIC array (EPIC) data. The model was trained on 1343 samples from the Young Finns Study cohort and validated across six independent datasets from four cohorts and two array platforms (EPIC and EPICv2).

EpiSmokEr2 achieved an average sensitivity of 0.87 and specificity of 0.86 in distinguishing current from never smokers. Predicted smoking status correlated strongly with established DNAm smoking scores and GrimAge, indicating its ability to capture biologically relevant smoking effects. Simulation analysis showed EpiSmokEr2 was robust for up to 10% missing CpGs.

EpiSmokEr2 provides a reliable DNAm-based estimator of smoking status. It is available as an open-source R package on GitHub, facilitating broad use in epidemiological and clinical research.

Smoking leaves chemical marks on our DNA, like footprints that reveal a person’s smoking history. We developed EpiSmokEr2, a tool that reads these marks to tell whether someone currently smokes, used to smoke, or has never smoked. The tool was built using data from Finnish participants and tested in other populations, where it also worked well. EpiSmokEr2 is free to use, and researchers can use it to uncover smoking exposure when people cannot or do not report it accurately. In the future, this tool could also help doctors check a patient’s smoking status.

## Full-text entities

- **Genes:** AHRR (aryl hydrocarbon receptor repressor) [NCBI Gene 57491] {aka AHH, AHHR, bHLHe77}
- **Diseases:** cardiovascular disease (MESH:D002318), Smoking (MESH:D015208), FTC (MESH:D004200), respiratory disorders (MESH:D012131), YFS (MESH:C536718), cancers (MESH:D009369), GuLF (MESH:D018923)
- **Chemicals:** cotinine (MESH:D003367), nicotine (MESH:D009538), BioRender (-)
- **Species:** Homo sapiens (human, species) [taxon 9606], Nicotiana tabacum (American tobacco, species) [taxon 4097]

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/PMC12962688/full.md

## Figures

8 figures with captions in the complete paper: https://tomesphere.com/paper/PMC12962688/full.md

## References

33 references — full list in the complete paper: https://tomesphere.com/paper/PMC12962688/full.md

---
Source: https://tomesphere.com/paper/PMC12962688