# A novel high-dimensional model for identifying regional DNA methylation QTLs

**Authors:** Kaiqiong Zhao, Archer Y Yang, Karim Oualkacha, Yixiao Zeng, Kathleen Klein, Marie Hudson, Inés Colmegna, Sasha Bernatsky, Celia M T Greenwood

PMC · DOI: 10.1093/biostatistics/kxaf032 · 2025-10-26

## TL;DR

This paper introduces a new statistical model to identify genetic variants that influence DNA methylation in specific regions.

## Contribution

The novel composite sparse penalty and efficient algorithm improve variable selection and smoothness in high-dimensional data.

## Key findings

- The proposed method outperforms sparsity-only approaches in estimation and prediction accuracy.
- Including smoothness control significantly enhances the identification of regional methylation QTLs.
- The method was successfully applied to asymptomatic samples from the CARTaGENE cohort.

## Abstract

Varying coefficient models offer the flexibility to learn the dynamic changes of regression coefficients. Despite their good interpretability and diverse applications, in high-dimensional settings, existing estimation methods for such models have important limitations. For example, we routinely encounter the need for variable selection when faced with a large collection of covariates with nonlinear/varying effects on outcomes, and no ideal solutions exist. One illustration of this situation could be identifying a subset of genetic variants with local influence on methylation levels in a regulatory region. To address this problem, we propose a composite sparse penalty that encourages both sparsity and smoothness for the varying coefficients. We present an efficient proximal gradient descent algorithm that scales to high-dimensional predictor spaces, providing sparse solutions for the varying coefficients. A comprehensive simulation study has been conducted to evaluate the performance of our approach in terms of estimation, prediction and selection accuracy. We show that the inclusion of smoothness control yields much better results over sparsity-only approaches. An adaptive version of the penalty offers additional performance gains. We further demonstrate the utility of our method in identifying regional mQTLs from asymptomatic samples in the CARTaGENE cohort. The methodology is implemented in the R package sparseSOMNiBUS, available on GitHub.

## Full-text entities

- **Genes:** BANK1 (B cell scaffold protein with ankyrin repeats 1) [NCBI Gene 55024] {aka BANK}, COMTD1 (catechol-O-methyltransferase domain containing 1) [NCBI Gene 118881] {aka MT773}, MIR4520-2 (microRNA 4520-2) [NCBI Gene 100616466] {aka MIR4520B, hsa-mir-4520-2, mir-4520-2}, PRTN3 (proteinase 3) [NCBI Gene 5657] {aka ACPA, AGP7, C-ANCA, CANCA, MBN, MBT}, LINC01252 (long intergenic non-protein coding RNA 1252) [NCBI Gene 338817]
- **Diseases:** HIGH (MESH:D052456), ALGORITHM (MESH:D007859), rheumatoid arthritis (MESH:D001172)
- **Chemicals:** FP (-)
- **Mutations:** rs6488428, rs34530485, rs5752590, rs113739199, rs73023414, rs74910977, rs4880435, rs2914359, rs12121709, rs138292190, rs72992741, rs56161922, rs4746260, rs1205169, rs991762, rs9905053, rs10067831, rs12243610, rs10773053, rs17766894

## Figures

3 figures with captions in the complete paper: https://tomesphere.com/paper/PMC12554007/full.md

---
Source: https://tomesphere.com/paper/PMC12554007