# Supervised learning of enhancer–promoter specificity based on genome-wide perturbation studies highlights areas for improvement in learning

**Authors:** Dylan Barth, Richard Van, Jonathan Cardwell, Mira V Han

PMC · DOI: 10.1093/bioinformatics/btae367 · 2024-06-13

## TL;DR

This paper uses machine learning to predict enhancer-promoter relationships from genomic data, revealing gaps in current understanding and improving prediction accuracy.

## Contribution

The study integrates enhancer perturbation data with genomic assays to improve enhancer-promoter prediction models.

## Key findings

- Genomic element density and contact strength are key features for enhancer-promoter prediction.
- Transcription factor peaks help reduce false positives in predictions.
- Integrating multiple data types improves model accuracy and understanding of enhancer regulation.

## Abstract

Understanding the rules that govern enhancer-driven transcription remains a central unsolved problem in genomics. Now with multiple massively parallel enhancer perturbation assays published, there are enough data that we can utilize to learn to predict enhancer–promoter (EP) relationships in a data-driven manner.

We applied machine learning to one of the largest enhancer perturbation studies integrated with transcription factor (TF) and histone modification ChIP-seq. The results uncovered a discrepancy in the prediction of genome-wide data compared to data from targeted experiments. Relative strength of contact was important for prediction, confirming the basic principle of EP regulation. Novel features such as the density of the enhancers/promoters in the genomic region was found to be important, highlighting our lack of understanding on how other elements in the region contribute to the regulation. Several TF peaks were identified that improved the prediction by identifying the negatives and reducing False Positives. In summary, integrating genomic assays with enhancer perturbation studies increased the accuracy of the model, and provided novel insights into the understanding of enhancer-driven transcription.

The trained models, data, and the source code are available at http://doi.org/10.5281/zenodo.11290386 and https://github.com/HanLabUNLV/sleps.

## Full-text entities

- **Genes:** CREBBP (CREB binding lysine acetyltransferase) [NCBI Gene 1387] {aka CBP, KAT3A, MKHK1, RSTS, RSTS1}, NR2C1 (nuclear receptor subfamily 2 group C member 1) [NCBI Gene 7181] {aka TR2}, FOXM1 (forkhead box M1) [NCBI Gene 2305] {aka FKHL16, FOXM1A, FOXM1B, FOXM1C, HFH-11, HFH11}, GATA1 (GATA binding protein 1) [NCBI Gene 2623] {aka CNSHA9, ERYF1, GATA-1, GF-1, GF1, HAEADA}, RBFOX2 (RNA binding fox-1 homolog 2) [NCBI Gene 23543] {aka FOX2, Fox-2, HNRBP2, HRNBP2, RBM9, RTA}, GATA2 (GATA binding protein 2) [NCBI Gene 2624] {aka DCML, IMD21, MONOMAC, NFE1B}, CHAMP1 (chromosome alignment maintaining phosphoprotein 1) [NCBI Gene 283489] {aka C13orf8, CAMP, CHAMP, MRD40, NEDHILD, ZNF828}, CEBPB (CCAAT enhancer binding protein beta) [NCBI Gene 1051] {aka C/EBP-beta, IL6DBP, NF-IL6, TCF5}, EP300 (EP300 lysine acetyltransferase) [NCBI Gene 2033] {aka KAT3B, MKHK2, RSTS2, p300}, POGZ (pogo transposable element derived with ZNF domain) [NCBI Gene 23126] {aka MRD37, WHSUS, ZNF280E, ZNF635, ZNF635m}, HCFC1 (host cell factor C1) [NCBI Gene 3054] {aka CFF, HCF, HCF-1, HCF1, HFC1, MAHCX}, BRD4 (bromodomain containing 4) [NCBI Gene 23476] {aka CAP, CDLS6, FSHRG4, HUNK1, HUNKI, MCAP}, NFIC (nuclear factor I C) [NCBI Gene 4782] {aka CTF, CTF5, NF-I, NF-I/C, NF1-C, NFI}, STAT5A (signal transducer and activator of transcription 5A) [NCBI Gene 6776] {aka MGF, STAT5}, JUNB (JunB proto-oncogene, AP-1 transcription factor subunit) [NCBI Gene 3726] {aka AP-1}, EREG (epiregulin) [NCBI Gene 2069] {aka EPR, ER, Ep}, MDFIC (MyoD family inhibitor domain containing) [NCBI Gene 29969] {aka HIC, LMPHM12, MDFIC1}, STAT2 (signal transducer and activator of transcription 2) [NCBI Gene 6773] {aka IMD44, ISGF-3, P113, PTORCH3, STAT113}, KLF16 (KLF transcription factor 16) [NCBI Gene 83855] {aka BTEB4, DRRF, NSLP2}, PML (PML nuclear body scaffold) [NCBI Gene 5371] {aka MYL, PP8675, RNF71, TRIM19}, E2F8 (E2F transcription factor 8) [NCBI Gene 79733] {aka E2F-8}, NR2F2 (nuclear receptor subfamily 2 group F member 2) [NCBI Gene 7026] {aka ARP-1, ARP1, CHTD4, COUPTF2, COUPTFB, COUPTFII}, STAT1 (signal transducer and activator of transcription 1) [NCBI Gene 6772] {aka CANDF7, IMD31A, IMD31B, IMD31C, ISGF-3, STAT91}, RAD51 (RAD51 recombinase) [NCBI Gene 5888] {aka BRCC5, FANCR, HRAD51, HsRad51, HsT16930, MRMV2}, XRCC3 (X-ray repair cross complementing 3) [NCBI Gene 7517] {aka CMM6}, ZBTB11 (zinc finger and BTB domain containing 11) [NCBI Gene 27107] {aka MRT69, ZNF-U69274, ZNF913}, ATG2A (autophagy related 2A) [NCBI Gene 23130] {aka BLTP4A}, EMSY (EMSY transcriptional repressor, BRCA2 interacting) [NCBI Gene 56946] {aka C11orf30, GL002}, Itpr3 (inositol 1,4,5-triphosphate receptor 3) [NCBI Gene 16440] {aka IP3R 3, IP3R-3, Ip3r3, Itpr-3, tf}, MAD2L2 (mitotic arrest deficient 2 like 2) [NCBI Gene 10459] {aka FANCV, MAD2B, POLZ2, REV7}, CBX5 (chromobox 5) [NCBI Gene 23468] {aka HEL25, HP1, HP1A, HP1alpha}
- **Diseases:** cancer (MESH:D009369), TF (MESH:D005171), neurodevelopmental disorder (MESH:D002658), EP (MESH:C564835)
- **Chemicals:** CRISPRi (-)
- **Species:** Mus musculus (house mouse, species) [taxon 10090], Homo sapiens (human, species) [taxon 9606]
- **Mutations:** ATG2A
- **Cell lines:** K562 — Homo sapiens (Human), Blast phase chronic myelogenous leukemia, BCR-ABL1 positive, Cancer cell line (CVCL_0004), S2 — Drosophila melanogaster (Fruit fly), Spontaneously immortalized cell line (CVCL_Z232)

## Figures

7 figures with captions in the complete paper: https://tomesphere.com/paper/PMC11211214/full.md

---
Source: https://tomesphere.com/paper/PMC11211214