# EPINTLM: enhancer–promoter prediction with pretrained k-mer embeddings and residual cross-attention

**Authors:** Thi Lan Nguyen, Hien Quang Kha, Phat Ky Nguyen, Minh Huu Nhat Le, Duc-Trong Le, Nguyen Quoc Khanh Le

PMC · DOI: 10.1093/bib/bbag064 · 2026-02-16

## TL;DR

EPINTLM is a deep learning model that predicts enhancer-promoter interactions using DNA sequences and genomic features with improved accuracy.

## Contribution

EPINTLM introduces a novel deep learning framework with cross-attention and residual aggregation for enhancer-promoter interaction prediction.

## Key findings

- EPINTLM achieves competitive AUROC and AUPR performance on a benchmark across six human cell lines.
- Ablation studies show cross-attention and residual aggregation are key to model performance.
- A unified preprocessing pipeline improves training efficiency and reproducibility.

## Abstract

Enhancer–promoter interactions (EPIs) play an important role in gene regulation, yet experimental mapping remains costly and limited in coverage. As a result, computational approaches are commonly evaluated under curated benchmark datasets, which pose challenges related to long-range sequence modeling, multimodal feature integration, and reproducible preprocessing. In this study, we present EPINTLM (Enhancer–Promoter Interaction Nucleotide Transformer Large Model), a deep learning framework designed to investigate architectural strategies for EPI prediction under standardized benchmark settings. EPINTLM integrates DNA sequence representations and genomic features by leveraging pretrained k-mer embeddings from the Nucleotide Transformer and explicitly modeling intra- and inter-sequence dependencies through residual self-attention and bidirectional cross-attention. We additionally introduce a unified preprocessing pipeline to improve training efficiency and reproducibility, and perform post hoc motif analysis to provide limited interpretability of learned sequence patterns. Evaluated on a widely used benchmark across six human cell lines, EPINTLM achieves competitive area under the receiver operating characteristic curve (AUROC) and area under the precision-recall curve (AUPR) performance relative to existing methods, with ablation studies highlighting the contributions of cross-attention and residual aggregation. These results demonstrate the utility of explicit cross-attention designs for paired regulatory sequence modeling within current benchmark constraints.

## Linked entities

- **Species:** Homo sapiens (taxon 9606)

## Full-text entities

- **Genes:** CTCF (CCCTC-binding factor) [NCBI Gene 10664] {aka CFAP108, FAP108, MRD21}
- **Diseases:** cervical carcinoma (MESH:D002583), leukemia (MESH:D007938)
- **Chemicals:** EPI (-)
- **Species:** Homo sapiens (human, species) [taxon 9606]
- **Cell lines:** HeLa-S3 — Homo sapiens (Human), Human papillomavirus-related endocervical adenocarcinoma, Cancer cell line (CVCL_0058), K562 — Homo sapiens (Human), Blast phase chronic myelogenous leukemia, BCR-ABL1 positive, Cancer cell line (CVCL_0004), HUVEC — Homo sapiens (Human), Finite cell line (CVCL_2959), NHEK — Homo sapiens (Human), Finite cell line (CVCL_9Q50), HELA — Homo sapiens (Human), Human papillomavirus-related endocervical adenocarcinoma, Cancer cell line (CVCL_0030), IMR90 — Homo sapiens (Human), Finite cell line (CVCL_0347), vein — Homo sapiens (Human), Finite cell line (CVCL_3722), GM12878 — Homo sapiens (Human), Transformed cell line (CVCL_7526)

## Figures

3 figures with captions in the complete paper: https://tomesphere.com/paper/PMC12908682/full.md

---
Source: https://tomesphere.com/paper/PMC12908682