Modeling nascent transcription from chromatin landscape and structure with CLASTER
Marc Pielies Avellí, Arnór Ingi Sigurdsson, Joaquim Ollé López, Takeo Narita, Nils Krietenstein, Chunaram Choudhary, Simon Rasmussen

TL;DR
CLASTER is a deep learning model that predicts nascent transcription levels using chromatin landscape and 3D structure data.
Contribution
CLASTER introduces a novel deep neural network integrating chromatin data to predict transcription.
Findings
CLASTER effectively translates chromatin data into kilobasepair-resolution transcription levels.
The model reveals genomic organization as a key factor in machine learning approaches for transcription.
CLASTER enables prediction of the impact of epigenetic perturbations in silico.
Abstract
We present the Chromatin Landscape and Structure to Expression Regressor (CLASTER), an epigenetic-based deep neural network that can integrate different data modalities describing the chromatin landscape and its 3D structure. CLASTER effectively translates them into nascent transcription levels measured at a kilobasepair resolution. The model provides a platform to understand the epigenetic drivers and learned rules of nascent transcription, and to predict the impact of in silico epigenetic perturbations. We conclude that the predominant locality of current machine learning approaches emerges as a signature of genomic organization, having broad implications for future modeling approaches. The online version contains supplementary material available at 10.1186/s13059-026-03992-5.
Genes, proteins, chemicals, diseases, species, mutations and cell lines named across the full text — each resolved to its canonical identifier and authoritative record.
Click any figure to enlarge with its caption.
Figure 1
Figure 2
Figure 3- —http://dx.doi.org/10.13039/501100009708Novo Nordisk Fonden
- —https://doi.org/10.13039/501100001732Danmarks Grundforskningsfond
- —https://doi.org/10.13039/501100000781European Research Council
- —Copenhagen University
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGenomics and Chromatin Dynamics · Epigenetics and DNA Methylation · Chromatin Remodeling and Cancer
Background
A single genome gives rise to a wide variety of cells, with diverse morphological properties and functions. These features of the cells emerge when certain regions of the DNA are expressed while others are kept silent, a process known as gene regulation. While traditional expression readouts like RNA-seq or Cap Analysis of Gene Expression (CAGE) focus on processed products of transcription and initiation sites, complementary techniques have been developed to track nascent transcription and elongation more closely, such as RNA labeling with 5-ethynyl uridine and sequencing (EU-seq) [1]. Newly synthesized transcripts indirectly reflect RNA polymerase distribution and dynamics across the genome [2]. Therefore, EU-seq profiles typically display a signal spike near the transcription start site (TSS) where polymerases bind and accumulate during promoter-proximal pausing. This peak diminishes across the gene body as fewer polymerases reach distal regions, resulting in lower transcript mapping. Additionally, EU-seq enrichment patterns contain subtle, higher resolution variations that are influenced by a number of factors. For instance, some degree of co-transcriptional splicing and polymerase velocity changes between exonic and intronic regions can result in EU-seq differences. This level of detail is however lost when studying broad genomic regions at a coarse resolution.
The rules governing how genes are regulated in different cellular contexts still constitute a highly debated topic, and their understanding is crucial to study development and disease. Many models have already been proposed approaching the problem at different biological layers [3–7]. At a high level, gene expression is modulated by the interplay of a set of regulatory elements, e.g. enhancers and promoters, which have been shown to be broadly compatible in most cases [8, 9]. Their activities are usually described using a number of epigenetic marks and can be combined multiplicatively to shape RNA transcription [9–11]. The establishment of functional pairings between these regulatory elements is then conditioned by their contact frequency, given by the three-dimensional structure of chromatin. Models such as the Activity by Contact (ABC) [12] and the more recent ENCODE-rE2G [13] exploit these features.
Regulatory elements and their associated genes are ultimately encoded in the DNA. Many computational approaches have therefore put a special emphasis on the genomic sequence, identifying patterns and motifs, integrating them and predicting a set of output tracks describing various downstream genomic and epigenomic processes [3–7]. The advent of large language models (LLMs) is also powering a new wave of sequence-based machine learning models such as Nucleotide Transformer [14], Borzoi [15], DNABERT [16], HyenaDNA [17] or Evo [18, 19]. These present improvements in several fronts, e.g., being more robust to genomic variability, extending the studied genomic regions or achieving higher resolutions while bringing down the computational costs. Finally, new advances in the 3D organization field and chromatin conformation capture techniques have brought to light finer structural details (Micro-C) [20, 21] and also show promise to improve predictive modeling [22, 23].
However, this explosion in the genomics analysis toolbox is mostly inherited from the computer vision and natural language processing fields, which pursue different objectives and describe data modalities that attend to highly different sets of rules. This is best illustrated by the fact that even some of the most advanced machine learning models rely primarily on the close neighborhood of the predicted points, without paying attention to most of the sequence provided [11]. This is of particular interest in the case of attention-based deep learning models, which handle the complete sequences at once and therefore do not impose any spatial constraints. This raises the question: are these machine learning models predominantly local because of current technical limitations and training schemes, or is this a property of the underlying data distribution, i.e., a signature of genomic organization induced by elements such as evolutionary pressure?
Here we present Chromatin Landscape and Structure to Expression Regressor (CLASTER), a deep convolutional neural network aimed to learn the mapping from a given chromatin landscape and its three-dimensional structure to the corresponding nascent RNA profiles. Similar principles have already been successfully applied in gene expression classification [24, 25], regression [22], and contact map prediction [26, 27]. The value of CLASTER is two-fold. First, it enables us to predict EU-seq profiles of nascent transcription from easily available chromatin marks in a DNA sequence agnostic manner. Second, the trained models constitute a platform to unveil new biology, since they can identify patterns between chromatin marks and across the genome at several levels of abstraction and combine them in a non-linear fashion. Of note, the multiplicative nature of enhancer and promoter activities deeply motivates the use of non-linearities, which would not be captured by linear models [3]. Finally, CLASTER allows us to explore the power of new perturbational approaches on such data to simulate the transcriptional impact of epigenetic remodeling, which has already shown great therapeutic potential [28].
Results
CLASTER integrated multi-modal chromatin data to predict nascent RNA levels in a sequence-agnostic manner
We developed CLASTER, a deep neural network designed using the EIR framework [29] and trained it to translate a given chromatin landscape and its corresponding high-resolution 3D contact frequency map to the matching target nascent RNA profiles (Fig. 1a, Additional file 1: Fig. S1, Additional file 2: Table S1). Nascent transcription levels were obtained for mouse Embryonic Stem Cells (mESCs) from Narita et al. 2021 [30], where they had been measured by RNA labeling with 5-ethynyl uridine and sequencing (EU-seq). CLASTER was trained to predict kilobasepair resolution EU-seq profiles in a wide region, spanning from −200 kbp to 200 kbp from the TSS of a gene of interest that was located at the center of the predicted window (Fig. 1b, Additional file 1: Fig. S2-4, Methods). Chromosomes 17 and 4 were used to obtain our validation and test set regions, respectively (Additional file 2: Table S2).Fig. 1CLASTER can translate the chromatin landscape and structure to the corresponding nascent RNA profiles in mESCs. a Model schema. CLASTER maps the epigenetic landscape of a chromatin region of 1Mbp and its matching contact frequency map to the corresponding nascent expression levels in a window of 401 kbp. b Targets and predictions around the gene Tmem222 (chr4) for the different models. Prediction ranges were adapted to the context length of each model. Gene annotations (mm10) and candidate cis-regulatory elements (cCREs) are shown on top. c Density plot showing log-transformed, binwise predictions vs. target values in all test samples, i.e. across protein coding genes in chromosome 4. d Length normalized nascent transcription levels for the central gene in each predicted sample. Normalized transcription levels were obtained by integrating the area under the EU-seq curve inside the gene boundaries and normalizing by gene length. e Pearson and Spearman correlations between test predictions and targets at binwise and gene levels for the different models
To perform the predictions, the network used two matrices representing the different data modalities, which were processed in separate branches of the model. The first branch processed four complementary genomic tracks that characterized the epigenetic landscape in a chromatin region of 1 Mbp at a 100 bp resolution. These four core tracks describe chromatin accessibility (ATAC-seq), promoter activity (H3K4me3), promoter and enhancer activity (H3K27ac) and heterochromatin (H3K27me3), respectively (Additional file 1: Fig. S5). The heterochromatin mark H3K9me3 was not used in order to keep the model compact and ease its generalizability to new datasets. An optional second branch of the model was aimed at processing structural data, in the form of high resolution contact frequency maps (Micro-C) or functionally enriched promoter-Capture Hi-C maps (prom-CHiC) exploring 1 Mbp at a time (Additional file 1: Fig. S6). The maps described contact frequencies between different parts of the genome for the same cell line and were obtained from [31] and [32] at 1.6 and 5 kbp resolutions respectively. The model then extracted features in both data modalities separately using a set of dilated convolutions and residual connections (Fig. 1a, Additional file 1: Fig. S1, Additional file 2: Table S1). Optional multi-headed attention layers were then added to the landscape branch to ease the processing of the sequential appearance of such features and long-range interactions between them. The resulting high-level features were then combined and integrated using a set of fully connected layers, and finally a dense layer connected the last hidden layer to each output separately.
CLASTER displayed a high test prediction accuracy, with Spearman and Pearson correlations between log transformed predicted values and ground truth signal intensities of r_s_ = 0.7717 and r_p_ = 0.8690 across all predicted bins (Fig. 1c, Additional file 2: Table S3). Test performances increased to r_s_ = 0.9275 and r_p_ = 0.8859 when comparing the expected nascent transcription levels inside target gene boundaries (Fig. 1d, Methods), where a lower number of near zero values would be expected. In summary, CLASTER predicted nascent RNA levels given the corresponding chromatin landscape and its matching high-resolution contact frequency map. This was achieved by extracting features at different levels of abstraction from both data modalities without relying on the DNA sequence. Of note, these predictions comprised most protein coding genes in chromosome 4, with no filters on specific ranges of expression levels nor their enhancer-dependence status.
Nascent transcription can be predicted from DNA sequence
The enrichment levels of different chromatin marks in a given locus, which define the chromatin states [33], may be interpreted as epigenetic embeddings containing relevant functional information of that genomic region. This makes them somewhat similar in role to the inner sequence representations of DNA-based models like the Enformer [4] or HyenaDNA [17], and therefore we wanted to compare their respective capabilities to predict the nascent RNA landscape described by EU-seq.
Sequence embeddings of the central 160 kbp regions matching our samples were obtained for the Enformer model with pretrained, frozen weights (Additional file 1: Fig. S7, Methods). We then trained a simple head mapping the embeddings to the target EU-seq outputs (Additional file 1: Fig. S8-9, Methods). The performance of the model head on Enformer’s pretrained embeddings, i.e., with no fine-tuning nor further training required, achieved Pearson and Spearman correlations of r_s_ = 0.8067 and r_p_ = 0.8739 for the test predictions (Fig. 1e, Additional file 1: Fig. S9-10). However, the target window had to be shortened given Enformer’s narrower context length and cropping of the latent sequence representations (Fig. 1b, Additional file 1: Fig. S11-12). Sequence embeddings obtained by pretraining the DNA language models HyenaDNA-160-kbp and 32-kbp in a next nucleotide prediction task could not be used in the same manner to directly predict nascent expression levels in a regression fashion (Methods). The embeddings could however be tailored to our task by fine-tuning the pretrained backbone and the added head together. Predictions using a fine-tuned HyenaDNA-32k model displayed Pearson and Spearman correlation coefficients of r_s_ = 0.7514 and r_p_ = 0.7403 (Fig. 1e, Additional file 1: Fig. S10).
In conclusion, DNA-sequence based models trained in a supervised manner like the Enformer provide embeddings encoding valuable functional information, which can be directly probed to predict nascent transcription levels. These results were to be expected since the Enformer was originally trained to predict 1,643 mouse genomic tracks, including RNA polymerase II binding and a number of CAGE experiments that were highly related to our outputs. HyenaDNA, a genomic sequence foundation model, could also be used to predict the nascent transcription landscape. However, sequence embeddings obtained from pretraining HyenaDNA in a next nucleotide prediction task did not encode the information necessary to predict our targets, requiring further fine-tuning of both HyenaDNA backbone and the added model head.
CLASTER learned the functional roles of chromatin states rather than those of specific marks
In silico perturbations of the inputs can allow us to retrieve some of the information learned by the models, while enabling the simulation of arbitrary epigenetic variability (Fig. 2). They might be used, for instance, to identify potential transcriptional changes induced by an epigenetic silencing of chromatin, a valuable alternative to genetic knock-out [28]. To this end, we performed virtual perturbational experiments aimed to simulate a loss of the activity of single active enhancers using only the chromatin landscape branch of CLASTER (Methods). Proximal and distal Enhancer Like Signatures (pELS, dELS), were obtained from [34] (Additional file 2: Table S4).Fig. 2. In silico perturbations unveil the learned regulatory logic. a Combinations of chromatin mark enrichment values at a given bin sampled from chromosome 4. Example chromatin states and the resulting changes induced by perturbations P1 and P2 are illustrated. b Absolute changes in the outputs as a function of the distance to the perturbed enhancer. Mean (red), standard deviation (shaded red) and maximum change (grey) are shown for each output node across chr19 enhancer-centered samples. c Length distribution of the silenced enhancers (chr19). d Absolute gene area changes as a function of the distance to the enhancer. e Histogram showing the order of the most affected gene, i.e. the primary target of each silenced enhancer. f Input mark profiles when inserting the 40 kbp chromatin region covering the KLF4 super enhancer starting 60kbp upstream of Akirin2. Note that we kept the lowered activity promoter and that inputs are shown at their resolution of 100 bp. g Input tracks when flipping the promoter-proximal chromatin landscape (± 10 kbp). h Predicted EU-seq profiles after the perturbations
We found that the enrichment levels of different chromatin marks in native chromatin were entangled, i.e., they did not explore all possible values but were only found in a set of allowed combinations encoding chromatin states with diverse functional roles [33] (Fig. 2a). Perturbing the enhancer mark H3K27ac alone (P1) had a modest impact on the predicted profiles (Additional file 1: Fig. S13). An isolated in silico ablation of H3K27ac levels in active enhancer loci yielded still active chromatin states with unaltered accessibility and background H3K4me3 levels, unlike what would be expected in vivo [30]. The integrative nature of the convolutions across tracks in the inputs led the model to map chromatin mark combinations to the corresponding RNA landscape instead of assigning independent roles to different input tracks. This limited the ability of single mark perturbations to infer biologically meaningful phenotypes, but illustrated that the ratios between marks and their combined contributions rather than isolated mark enrichments define the regulatory role of a genomic region [35, 36].
In silico silencing of enhancer loci induces mainly the downregulation of close genes
To better simulate the effects of an epigenetically silenced enhancer, the whole chromatin state at each target enhancer locus was substituted for a heterochromatic-like state (P2), enhancer by enhancer. An arbitrary silenced state was obtained by setting all marks except the silencing H3K27me3 to zero reads (Methods). The introduced chromatin state substitutions mainly modified the predicted expression profiles around the perturbed point in a distance-dependent fashion, having an influence on a varying number of genes (Fig. 2b, c). Interestingly, the perturbations led to a sustained decrease in EU-seq across the body of the affected domains while keeping the overall structure characterizing the profiles (Additional file 1: Fig. S13, bottom). Clusters of neighboring ELS affected the same region, pointing towards a coordinated action of several enhancers [13]. Furthermore, the degree of downregulation that could be exerted by in silico silencing an active enhancer on any given gene presented a clear inverse relation with the distance separating them, decaying rapidly after a few kilobase pairs (Fig. 2d, Additional file 2: Table S5). From a complementary point of view, ranking the primary gene targets of each silenced enhancer as a function of their order of proximity further confirmed that the most affected gene was in most instances the closest one (932/1396 ~ 67% of the active enhancers in chromosome 19) (Fig. 2e, Additional file 1: Fig. S13, Methods).
To summarize, predicting the EU-seq levels in a genomic region of 401 kbp per sample allowed us to assess the range and degree of influence of a set of tailored in silico perturbations in enhancer loci. The effects of such perturbations were not propagated towards regions far from the perturbed locus, indicating a predominantly local behavior of the network. CLASTER linked the perturbations in those loci to the adjacent domains with significant EU-seq enrichment, modifying in a distance-dependent manner the transcription profiles and therefore affecting a varying number of genes found around each perturbed locus.
Benchmarking predicted effects with experimental enhancer knock-outs
To assess the ability of in silico simulations to recapitulate true biology, we aimed to compare CLASTER’s predictions with ground truth associations obtained from experimental enhancer knock outs using CRISPR interference (CRISPRi). The latest, most comprehensive study on enhancer-gene interactions defined by CRISPR perturbations was performed by Gschwind et al. [13], a collaborative effort harmonizing a number of previous seminal works mostly performed on the human erythroleukemia cell line (K562) [37–39]. Given the lack of publicly available data matching our experiments for K562 (Methods), we retrained CLASTER to predict widely available tracks describing transcription. These included mature RNA-seq genome-wide profiles for the corresponding erythroleukemia cell line and also RNA-polymerase II binding profiles (POLR2A ChIP-seq), as a proxy of nascent transcription (Additional file 1: Fig. S14-16). When predicting RNA-seq, the model could estimate gene boundaries and expression levels but failed to distinguish exons from introns due to the lack of information in the inputs (Additional file 1: Fig. S14). This led to a poorer performance on the baseline prediction task than when predicting EU-seq (Additional file 2: Table S2, Additional file 1: Fig. S15). Similarly, the model could locate POLR2A enrichment peaks, but struggled to predict their magnitude, perhaps due to the wide dynamic range of the mark or a higher degree of stochasticity (Additional file 1: Fig. S14-16).
We then centered the prediction window at the enhancers targeted in the CRISPR assays, computed the output changes after in silico silencing those that were active (Methods) and used these changes as a quantitative readout of the impact of the perturbed enhancer on each gene inside the predicted window (Additional file 2: Table S6, Methods). POLR2A binding changes were measured around the TSS of the genes to capture promoter-proximal binding and pausing peaks, while RNA expression differences were integrated inside gene boundaries (Additional file 1: Fig. S17, Methods).
Comparing the primary gene target of our perturbations against CRISPR ground truths we obtained the following precision-recall values: (Pr = 0.272, Re = 0.427) using the model predicting RNA-seq, (Pr = 0.302, Re = 0.486) using the model predicting POLR2A, and (Pr = 0.507, Re = 0.622) assigning the closest gene to each enhancer (Additional file 2: Table S7, Additional file 1: Fig. S18). The accuracies were 0.72, 0.73 and 0.84 respectively, notably high given the class imbalance towards non-interacting pairs. Absolute integrated changes for each gene and each output modality were then used as continuous scores of interaction likelihood. These were also normalized to the change in the most affected gene inside the same predicted window (Methods), infusing some information about the local environment of each potential target to the scores. The normalized scores for RNA-seq, POLR2A, and distance to the enhancer yielded average precisions (AUPRC) of 0.241, 0.258 and 0.485 respectively, with a score of 0.164 for the random baseline (Additional file 1: Fig. S18). The performances at their best precision-recall thresholds were (Pr = 0.223, Re = 0.580), (Pr = 0.242, Re = 0.618) and (Pr = 0.402, Re = 0.667) (Additional file 2: Table S8). These results at high recall regimes are comparable with those obtained by other epigenetic-based methods as reported in Gschwind et al. [13], in particular when not using enhancer-promoter distance to explicitly inform the scores.
In summary, we extended CLASTER’s predictions to alternative and more widespread readouts of transcription, aiming to compare the predicted effects of our in silico perturbations with experimental ground truths. Despite the expected loss in prediction accuracy, the results align with those from other chromatin based approaches. Of note, these do not clearly outperform the baseline model assigning each enhancer to the closest gene, which is still a key source of information when used to classify interactions. The fact that the distance between an enhancer and a promoter is such a good predictor of interaction justifies the local and distance-dependent behavior of CLASTER and further supports our claims.
In silico perturbations unveil the learned regulatory logic
The perturbational framework was then extended to new scenarios in order to shed light into the regulatory rules learned by the model (Fig. 2f-h). The idea that the promoter mark constitutes a pointer to the TSS and hence the model predicts an EU-seq enrichment adjacent to it was tested by in silico adding a new promoter to a non-coding region. We copied the chromatin state at the promoter region of Akirin2 (chr4), and inserted it into the intergenic region upstream of said gene. This resulted in a new EU-seq peak with a sharp increase and a slower decay, as expected (Fig. 2h, dark green). The notion that promoter activity drives the magnitude of the EU-seq signal was tested in silico by modulating the height of the H3K4me3/H3K27ac peak. Here, we lowered (× 0.5) and increased (× 2) the enrichment levels of the input marks at the promoter proximal region. This resulted in locally decreased and enhanced EU-seq levels, respectively (Fig. 2h, black and light green). The role of proximal enhancers in driving transcription could also be evaluated. Here we copied the chromatin state of the KLF4 super enhancer, a super enhancer in mESCs also in chromosome 4, and inserted it immediately upstream to Akirin2 (Fig. 2f, shaded blue). Since Akirin2 seems to have a strong promoter that would mask the enhancer effect, we kept the lower activity (× 0.5) promoter. This yielded an increased transcription when compared with the sample where only the promoter was lowered (Fig. 2h, red). Finally, we wanted to know if the model used the extension of ATAC-seq or H3K4me3 profiles [40] towards the gene body to define the direction of transcription. To keep the realism of the chromatin state, we flipped the promoter proximal region of Akirin2, which has a clear accessibility and promoter mark extension (Fig. 2g). This resulted in the desired effect, i.e. the transcriptional profile reflected that of a gene on the negative strand (Fig. 2h, yellow).
These examples illustrate the potential of in silico perturbations to shed light into the regulatory rules learned by neural networks, in particular CLASTER, when aiming to understand the process of transcription from an epigenetic perspective.
Most of the provided context was not used for the predictions
To disentangle the contributions of all marks, we used the integrated gradients attribution method [41]. This axiomatic attribution mechanism enabled us to quantify the relative importance of the different chromatin marks at all measured loci towards the predicted EU-seq levels at each target locus (Fig. 3a). Positive and negative attribution scores indicate direct or inverse associations between input and output enrichment levels, respectively. The resulting maps portrayed the expected relative contributions of all marks at a given distance to an arbitrary predicted point (Methods).Fig. 3. Averaged attributions provide information about chromatin mark significance and scale of influence in the prediction task. a Attribution method schema. Micro-C arrays are rotated 45 degrees and corners are cropped to yield rectangular maps (Additional file 1: Fig. S22). Each of the positions in both chromatin arrays and structural arrays are given an attribution score towards the prediction of each single output node. Attribution arrays are centered, stacked and averaged to obtain an attribution map as in d). b Averaged attributions of each genomic track in the chromatin landscape branch. Attributions are centered at the prediction site and rescaled between zero and the maximum attribution score. c Chromatin landscape attributions when the model is also provided contact maps as input. d Absolute attribution map for the chromatin structure branch. When overlaid to a contact map region like the blue or black rectangles in a), it conveys the importance that the model attributes on average to each part of the map when predicting nascent transcription levels at the genomic locus corresponding to the top center of the map (the pointer). Color scale ranges from no attribution, i.e. Score = 0, to maximum attribution, with score 1. Map smoothed with gaussian kernel of σ = 1. e Average of signed attribution score maps for the structure branch. Scale ranges from −1 to 1, where 0 corresponds to no contribution. Positive values (red) are assigned to regions that contribute to raise output EU-seq levels. Negative values (blue) indicate an inverse relationship between input contact signals and output EU-seq levels. Map smoothed with a gaussian kernel of σ = 2 to ease the identification of regions. f Model performance comparison. The addition of contact information or attention to handle long range interactions did not improve the accuracy of the predictions
The promoter mark H3K4me3 was found to be the most important predictor of nascent expression levels when using only the chromatin landscape branch of the model, as it was assigned the highest attribution score (i.e. feature importance) around the predicted point (Fig. 3b). The averaged attribution scores of the silencing mark H3K27me3 were comparable to those of H3K4me3, while accessibility and the enhancer and promoter mark H3K27ac were assigned lower scores, peaking at around 60% and 40% of that of H3K4me3, respectively. Strikingly, the integrated gradients analysis presented a sharp, power-law decay of the attributions as a function of the distance to the predicted locus. Attribution scores dropped to less than 20% of their peak value after 5 to 10 kbp from the predicted locus, depending on the mark (Fig. 3b). Here, the curves presented an elbow, after which the decay followed at a slower rate. Further than 50 kbp away, the attribution scores were almost negligible. Similar results were obtained when removing the enhancer mark H3K27ac from the inputs, which led the model to shift the focus from H3K4me3 to ATAC-seq and H3K27me3 while keeping the range of influence of the marks and overall predictive power (Additional file 1: Fig. S19).
Thus, CLASTER relied mainly on a window of 20 kbp around the target locus to perform the predictions, while leaving most of the sequence unattended. This rather local behavior, observed also in the perturbational scenario, contrasts with the fact that 1 Mbp of context was available to the network and provided a complementary support to the results presented in [11]. These findings were further validated by shrinking CLASTER’s input context to 20 kbp, yielding similar performances to the baseline (Additional file 2: Tables S1 and S3, Additional file 1: Fig. S20). A significantly wider field of view than 20 kbp was covered by the dilated convolutions rather early in the network (Additional file 2: Table S1), and therefore it is improbable that this locality appeared as a limitation of the receptive field of the convolutions.
Contact maps provided an added layer of information and interpretability to CLASTER
The fact that the model was not attending to potential distal interactions motivated us to add chromatin contact maps explicitly as inputs to the model. Convolutional neural networks could in principle be sensitive to various structural patterns in such contact matrices, e.g. point-like interactions between separated elements, chromatin compartments and topologically associated domains (TADs), which would not be explicitly accounted for in more classical approaches [12].
The use of contact maps in their usual, symmetrical format, induced in fact the creation of several convolutional filters capturing the aforementioned features (Additional file 1: Fig. S21, left). We found that the symmetry of the original matrices limited the power of image processing-based approaches when studying Micro-C data, as the nuanced features extracted in early stages of the network were lost in deeper layers, where primarily a broad sense of regional contact was propagated forward (Additional file 1: Fig. S21, right). In addition, complementing the available information with contact maps induced a decrease in the attributions of the model to the promoter mark (Fig. 3c), suggesting a certain degree of overlap in the information coming from the two data modalities. To introduce contact information in a more appropriate manner, Micro-C matrices were rotated and cropped, and explored by scrolling the convolutional filters on the sequence axis only (Fig. 3a, Additional file 1: Fig. S22, Methods). This procedure reduced the redundancy of the information and skipped the contact-poor corner areas of the contact maps.
Following the same steps as before, we proceeded to evaluate the contributions of every bin in a Micro-C matrix towards the prediction of EU-seq levels at the central position (Fig. 3a). The attribution map obtained from absolute scores brought to light two different behaviors of the network depending on the distance in 1D sequence between points in physical contact (Fig. 3d). Attributions of bins connecting loci closer than ca. 30 kbp showed a clear peak around the predicted point, i.e., the main contribution to the nascent transcription levels in the predicted locus came from the inputs describing the local environment and interactions of the predicted locus with its close neighborhood at a TAD scale. On the contrary, the network did not use the local neighborhood of loci located more than 30 kbp upstream or downstream of the prediction point. These results were in line with our previous findings based on chromatin state attributions. Interestingly, the network did now attend to regions of the map linking more distal regions. The network’s attributions for all bins connecting loci found at the same distance in 1D sequence were quite similar from 30 kbp on, yielding a certain positional invariance (Fig. 3d). The main variation in attributions therefore occurred across different distances and was characterized by a sharp increase at around 30 kbp.
To assess whether a high level of contact in a certain region of the map contributed to raising or lowering the expression levels at the predicted locus, we then computed averages of the maps with signed attributions (Fig. 3e, Additional file 1: Fig. S23-24). The neighborhood of the predicted point had once more the highest impact on the output, with a higher number of counts or contact in that region contributing the most to raising expression levels. Of note, positive attribution scores were assigned to loci in physical contact with the predicted locus, indicating also a positive association between input and output signals. This feature was quite conserved across contact distances from 30 kbp onwards and suggested a potential boost of expression in the target profiles when a contact in such a region was to be found (Fig. 3e, blue lines). Finally, these results were replicated using lower resolution promoter-capture Hi-C maps, enriched in interactions with promoters. The model once more focused on the local environment of the predicted locus, giving positive attribution scores to its vicinity and the elements in contact with it while assigning negative attributions to the local environment of distal regions (Additional file 1: Fig. S25-26).
In summary, CLASTER could integrate contact information to that already provided in the chromatin landscape, understanding the impact of features in different regions of the contact maps towards the corresponding expression levels. The network now attended to distal interacting elements, but the main contributor to the predictions was still the local environment around the target locus, which displayed a clearly distinct behavior.
Explicit modeling of distal interactions did not significantly impact the performance
The introduction of multi-headed attention has been shown to improve the prediction performance of DNA-based models [4, 15, 18]. Unlike convolutions, which process information at different scales hierarchically, the attention mechanism allows the models to process all parts of the sequence simultaneously. Chromatin-based models could also benefit from such a mechanism if the sequential appearance of patterns in the input or the interactions between distal regions were key factors to understand the data. We therefore introduced a Transformer Encoder block with two multiheaded attention layers, aiming to process the sequential appearance of the most abstract features in the chromatin landscape only setting (Fig. 1a, Additional file 1: Fig. S1, Additional file 2: Table S1). However, a baseline model employing only convolutions on the chromatin landscape branch alone reached r_s_ = 0.7814 and r_p_ = 0.8852 in the test set, performing as well as the one incorporating attention layers, which achieved r_s_ = 0.7815 and r_p_ = 0.8808. (Fig. 3f, Additional file 2: Table S3). This result was robust to dataset split changes, as testing on chromosome 19 samples yielded r_s_ = 0.7677 and r_p_ = 0.8880 (Additional file 1: Fig. S27). This was still the case when coupling high resolution structural information from Micro-C maps to CLASTER, with r_s_ = 0.7717 and r_p_ = 0.8690 and when providing functionally enriched contact data, with the prom-CHiC CLASTER model reaching r_s_ = 0.7718 and r_p_ = 0.8736 (Additional file 1: Fig. S25). Despite the model capturing rich features from structural data and understanding their relation to the output as shown previously, the added contact information did not boost the overall ability of CLASTER to fit the output EU-seq profiles.
Overall, our findings show that predictions could mostly be performed without relying on interactions between highly distal regulatory elements. Most of the information required for the predictions could already be extracted by integrating the local chromatin landscape using dilated convolutions. Importantly, the prediction task was framed as a locus-by-locus regression of binned transcription levels, and therefore the main factor limiting CLASTER’s predictive power was the lack of spatial information in the inputs. This information was not to be found in contact maps with lower resolutions either, which could for instance contribute to raise the overall expression levels for enhancer-dependent genes, but would not necessarily encode higher resolution variation coming from exonic vs intronic differences. Lastly, the final dense layers of CLASTER could already handle long-range interactions of abstract features, and hence the attention mechanism might need to be introduced earlier in the architecture to process lower-level features.
Discussion
Our results support a rather confined gene regulation scenario. A predominantly local regulatory paradigm would seem plausible from the premises that most enhancers can activate most promoters [8, 9] and that both frequency and intensity of the interactions between different genomic elements rapidly decay as a function of their separation in 1D sequence [20] and are further constrained by the 3D compartmentalization of chromatin. An increased body of work from different labs would in fact point in this direction from complementary perspectives. First, recent CRISPR-based studies have found variants in noncoding regions to alter mostly their nearmost genes [13, 37, 42, 43]. Although predictive models have achieved high precisions in their predictions, the naïve baseline model associating a given genomic variant to the closest gene still provides a high sensitivity, i.e. a remarkably high number of true cases are captured under this assumption [37, 44]. Global enhancer ablation assays in a native chromatin environment illustrated biases in the distribution of enhancers in the genome, showing a preference for cell type specific genes to be frequently located close to strong enhancers [8]. Finally, recent high resolution chromatin conformation capture assays proposed micro compartmentalization rather than loop extrusion mechanisms to be driving enhancer-promoter interactions [21], linking diseased phenotypes to the disruption of compartment boundaries [45].
Despite these findings, the difficulty of state-of-the-art machine learning approaches to capture the impact of distal enhancer variants in gene expression is still the focus of improvement for the next generation of models, which are rapidly benefiting from the latest advances in NLP and have a strong focus on the exploration of larger architectures [14] and the processing of longer context lengths [15, 17, 22, 23]. These efforts contrast with the fact that early transformer-based models like the Enformer [4] already achieved remarkable results while relying on reduced genomic windows for the predictions [11]. Such behavior, which both the Enformer and CLASTER display, emerges as a consequence of the overall data distribution characterizing the studied system [8]. In other words, the spatial arrangement and organization of different regulatory elements in the genome does not reward the models enough to capture the expression changes induced by highly distal functional interactions, which become weaker [46] and increasingly rare with distance [11, 47]. Despite long range interactions being reported to play crucial roles in specific regulatory processes [48–50], the results presented here support the idea that the information required to perform gene expression prediction is in most cases to be found in the close neighborhood of the predicted locus. Neither the addition of contact information matrices nor the introduction of the multi headed attention mechanism, both aimed to ease the processing of long-range interactions, yielded significant changes in overall performance. These findings, which build on top of previous work [8], have wide implications in the way we understand genomic organization and the extent to which it influences the transcriptional landscape, and invite us to rethink how new computational models can be designed to more suitably exploit such intrinsic properties of the genome.
We show how in silico perturbations of a given chromatin landscape can provide an added layer of explainability to the model’s predictions. These perturbative approaches exploit the rules in the input that the models have learned to explore the effects of unseen epigenetic variability but come with a number of challenges. On the one hand, perturbations of isolated input tracks can fail to mimic the expected biological behavior given the correlations established between the different marks, which define the chromatin states [33]. Chromatin-mark based models aimed to perform new synthetic/in-silico perturbations should either aim to replicate a complete desired chromatin state in a given locus of interest or successfully disentangle the roles of different marks. Sequence-based models might be less prone to display such correlational effects given that the one-hot encoding of nucleotides makes in-silico SNPs, insertions and deletions in-distribution alternative inputs for a given locus. However, the same type of correlational dynamics and co-appearance patterns might be learned between different parts of the sequence. In fact, Enformer has been shown to incorrectly predict the direction of the effects of certain genomic variants, in particular for non-monogenic traits that would imply the combined action of several loci [51, 52]. These dynamics could constitute a hint towards the observation of a similar correlational learning framework as the one we reported in our first perturbational simulations, in this case across the sequence.
This study focused on the prediction of EU-seq, an assay based on metabolic labeling of nascent transcripts in vivo. Subsequent research should extend the training of models to predict complementary readouts of nascent transcription in vitro, such as GRO-seq or PRO-seq [53]. Despite providing similar information, these nuclear run-on techniques are characterized by a higher resolution signal, ideal for precisely mapping polymerase densities and pausing sites. However, this higher resolution makes the signals inherently noisier. This not only complicates the quantification of nascent transcripts but also makes the signals more challenging to predict from the chromatin landscape and structure alone than the proxies used in this manuscript, namely RNA-seq and POLR2A binding profiles. Left to be explored are better ways to couple high resolution contact matrices with chromatin mark profiles, with graph attention networks already showing great promise [22]. Nevertheless, the high sequencing depth requirement on higher resolution chromatin conformation maps and the related costs compromise scalability for machine-learning tasks. Our findings would motivate the exploration of region capture techniques to map shorter regions at a more fine-grained level [21].
Further work should aim to move from a mostly correlational learning framework to a more causal one, which is of utmost importance to determine the directionality of the association between transcription and the presence of certain chromatin marks. Training chromatin-based models in a GPT [54] or BERT [55] inspired masking fashion would perhaps provide the opportunity to do so, disentangling the roles of different epigenetic marks while studying sequences from varied cell type backgrounds [56]. Finally, transcription factors and their binding profiles are a key missing piece of our models, with their addition as extra input tracks being the next logical step.
Conclusions
We present CLASTER, a deep convolutional neural network that can be used to build an implicit gene regulation model, agnostic to DNA sequence and without relying on the manual extraction of a number of curated features. CLASTER can understand the relations established between different epigenetic marks describing a given chromatin state, how these states are combined to create a chromatin landscape, and how a given landscape and its matching three-dimensional arrangement can be translated into nascent transcription levels. Given its implementation using the EIR framework, CLASTER can easily be extended to new datasets, data modalities and tasks. Therefore, we expect our work to serve as an inspiration and an easy entry-point to epigenomics modeling for a broad range of researchers with diverse computational resources and scientific backgrounds.
Methods
Processing of chromatin mark array data
Normalized enrichment levels for ATAC-seq, H3K4me3, H3K27ac and H3K27me3 were obtained at a 100 bp resolution from publicly available data in the NCBI GEO under accession numbers GSE146328 and GSE186349 for mESCs and GSE163043 for human erythroleukemia (K562) cells [30, 36, 57]. Each sample consisted of an array of shape (N_chrom marks_, N_bins_) = (4, 10,001) where each row was a different mark and each column a 100 bp bin with the mark’s averaged enrichment level (see Fig. 2f, g). Input samples were centered at the TSS of a reference, protein coding gene, and explored ~ 1Mbp of genomic context (Additional file 2: Table S2). Note that a given gene could appear in different samples but at different locations, if it was close enough to the central predicted gene. A detailed description of the genomic regions constituting our samples and matching targets can be found in Additional file 2: Table S2. Data distributions are illustrated in Additional file 1: Fig. S5 and 16 and a guide through data obtention and preprocessing steps can be found at the Github repository of the project https://github.com/RasmussenLab/CLASTER.
Processing of nascent RNA array data
Nascent RNA labeling with 5-ethynyl uridine and sequencing (EU-seq) profiles were obtained from publicly available sources (GSE146328 [30]). A thorough description of the data obtention protocol can be found in the corresponding paper [30]. In summary, cells were pulsed with 5-EU for 20 min, which labelled newly synthesized transcripts, before harvesting the cells and sequencing the labelled RNA. Since our aim was to estimate integrated nascent gene expression levels and to study the propagation and impact of in silico perturbations in as many genes as possible per sample, we prioritized the prediction of a wide genomic region over output resolution. Output profiles matching our input samples were first obtained at a high resolution of 20 bp and centered at the TSS of each reference gene (Additional file 2: Table S2). A gaussian kernel of sigma = 50 bins, was then applied to smooth the signals before averaging them in bins of 1 kbp (Additional file 1: Fig. S3-4). This procedure was aimed to remove high resolution variation in the EU-seq profiles that was not encoded in our inputs while preserving the main source of variation of the profiles at the kbp scale (Additional file 1: Fig. S4). This reduced drastically the number of targets, speeding up the computations of input–output attribution scores. Each target profile was finally described by 401 output bins at a kbp resolution centered at the TSS of the gene of interest, allowing genes close to the boundaries to have access to a wide input chromatin context. Training and test EU-seq value distributions matched each other until ca. 200 reads, with values above 400 being rare (Additional file 1: Fig. S5). Target profiles were further cropped to fit the target windows of the Enformer, Hyenadna-160k and Hyenadna-32k, detailed below.
Processing of RNA-seq and POLR2A data
ChIP-seq binding profiles for POLR2A, the largest RNA polymerase II subunit, and RNA-seq were obtained for K562 cells from GSE91426 and GSE78558 respectively [58]. RNA-seq reads from negative and positive strands were integrated into a single track. The preprocessing steps were the same as for EU-seq targets. We first defined a filter to set potential invalid values, e.g. NaN or negative, to zero. Output profiles were obtained at a high resolution of 20 bp, we removed high frequency variation by applying a gaussian filter of sigma = 50 bins and finally averaged the outputs in bins of 1 kbp (Additional file 2: Table S2). Each target sample was described by 401 output values centered at the TSS of a reference protein coding gene, which allowed genes close to the boundaries to have access to a wide input chromatin context. The maximum signal was finally clipped to 1000 reads/bin to constrain the dynamic range of the outputs (Additional file 1: Fig. S16).
Processing of Micro-C contact maps
Micro-C matrices describing chromatin contact frequencies in mESCs were obtained at a resolution of 1600 bp per bin from NCBI GEO under accession number GSE130275 [31]. Lower resolution mcool representations started to lose the nuances in structural detail provided by Micro-C, but higher resolution matrices became too sparse. Bin values had a range of 7 orders of magnitude after balancing, with a minimum of 6,807 · 10^–7^ and a maximum of 2,001 in our data (Additional file 1: Fig. S6). NaN values due to balancing and 0 count bins were then set to 10^–14^ → log_10_ = −14. This arbitrarily small value was chosen to be the same distance apart from the minimum actual contact value as the minimum to maximum contact range. Imputing by the minimum 10^–7^ could be understood as an averaged homogeneous background contact, which we tried to avoid by giving those bins a clearly distinct behavior. Micro-C matrices were originally treated as 2D arrays of shape (625, 625) containing log_10_ transformed bin counts. A list of regions for which Micro-C matching maps could not be obtained is found in Additional file 2: Table S2.
Rotations and cropping of Micro-C maps
Contact matrices displayed two properties: 1) they were symmetrical and 2) the number of reads decreased when exploring regions far from the diagonal. Treating them as images, i.e. allowing for the convolutions to scroll in both x and y directions, led to an under usage of the model’s capabilities given the redundancy found in both triangles and the lack of information provided by the regions close to the corners far from the diagonal. In order to provide the contact information in a more appropriate manner for a model such as CLASTER, Micro-C matrices were rotated 45 degrees and cropped. This procedure allowed us to 1) reduce the number of inputs, accelerating the computations and 2) define deeper 1D kernels exploring all distances at once. These kernels were then only allowed to scroll sideways, in the original diagonal direction. Given the symmetry of the matrices, we kept only the upper triangle. The rotated triangle was cropped to ca. 0.207 times the original length to create a rectangular matrix of shape (129, 626). Distances and scales in the new matrix are illustrated in Additional file 1: Fig. S22 and can be expressed as follows. Expressing \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$a$$\end{document} and \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$a^{\prime}$$\end{document} in terms of \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$L$$\end{document} , we have:
\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$2a + L =\sqrt{{L}^{2}+ {L}^{2}}=\sqrt{2}L$$\end{document}We get \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$a = \left(\frac{\sqrt{2}-1}{2}\right)L \simeq 0.2071067812 L$$\end{document} and \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$a^{\prime}=a/\sqrt{2}$$\end{document} . After rotating and cropping the matrices, we have 626 bins for a smaller window of observation. The width of the new window was then \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$L/\sqrt{2}\simeq 0.70710678118 L$$\end{document} where \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$L = 1 Mbp$$\end{document} . Linear bin densities in the original and transformed Micro-C matrices were:
\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${\lambda }_{original}= \frac{625\ bins}{1000\ kbp}=0.625\ {bin}/{kbp}, {\lambda }_{transformed}= \frac{626\ bins}{707.106\ kbp}\simeq 0.8853\ bin/kbp$$\end{document}The interaction distance \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${d}_{1D}$$\end{document} , i.e. the real separation in 1D sequence between two interacting elements, was implicitly encoded in the new y coordinate \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${y}_{new}$$\end{document} , which was the distance in the perpendicular direction to the diagonal. A point at a distance \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${y}_{new}$$\end{document} from the diagonal, found at a column bin \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${y}_{bins}$$\end{document} in the rotated and cropped matrices, connected the points in the diagonal at \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${-x}_{bins}=-|{y}_{bins}|$$\end{document} and \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${x}_{bins}=|{y}_{bins}|$$\end{document} given that the scale was the same in both axes. This distance can be converted from new bins to kbp in 1D sequence using the diagonal bin density found before as \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${d}_{1D}= \frac{2\cdot {y}_{bins}}{{\lambda }_{transformed}}$$\end{document} .
Processing of promoter-capture Hi-C maps
Promoter-Capture HiC maps were obtained for mESCs from publicly available data deposited in GSM2753060 by [32]. The FASTQ files were processed using the distiller pipeline, with the original parameters for mapping, parsing, deduplication control, and binning as the yml in the repository (https://github.com/mirnylab/distiller-nf). Reads were mapped to the mouse reference assembly (mm10) using BWA-MEM (Burrows-Wheeler Alignment for 70 bp to 1 Mbp reads) with flags for high accuracy mapping of paired end reads (-SP). A chunk size of 30 M reads was applied for optimized distribution computing. Paired-end reads were parsed into read pairs using the pairtools package (‘make_pairsam’, ‘drop_readid’, ‘drop_seq’), with additional options to store MAPQ scores (–add-columns mapq) and mask complex walks in long reads (–walks-policy mask). Reads considered PCR and optical duplicates, read pairs with a maximum allowed mismatch of 1 bp between pairs on both sides, were removed. The processed read pairs were binned into multi-resolution contact maps (multi-cool/mcool files) using the cooler package, with resolutions ranging from 10 Mb to 100 bp (500 bp, 1 kb, 2 kb, 5 kb, 10 kb, 50 kb, 100 kb, 250 kb, 500 kb, 1 Mb). The contact matrix was normalized using the iterative correction procedure126, and pairs with low mapping quality, with MADmax (maximum allowed median absolute deviation) threshold of 30, were excluded during binning. The output files were balanced to ensure proper normalization of the contact matrix.
Finally, contact maps matching our chromatin landscape samples, i.e. exploring 1 Mbp at a 5 kbp resolution and centered at the TSS of the reference genes, were introduced in their original squared form to the structural branch of CLASTER. We used the same imputation scheme as for Micro-C data, imputing by −14 after balancing and log10 transforming bin counts. This was done to give a clearly distinct behavior to imputed bins instead of that of a “low contact” area.
Reference genome, gene annotations and enhancer annotations
The mouse reference genome and protein coding gene annotations for both mouse and human were downloaded from the GENCODE (mouse; GRCm38 release 25, human; GRCh37-hg19) [59]. Proximal and distal Enhancer-Like Signatures (pELS, dELS) in mice were obtained from Supplementary Table 11 in [34]. Human enhancer signatures targeted in CRISPR KO experiments by Nasser et al. [37], Gasperini et al. [38] and Schraivogel et al. [39] were obtained from the harmonized dataset presented in Gschwind et al. [13]. Enhancer annotations were lifted to hg19 using the LiftOver (https://genome.ucsc.edu/cgi-bin/hgLiftOver) tool provided by the UCSC Genome Browser [60] to match the assembly used for the input chromatin marks. Superenhancer annotations for mESCs were obtained from dbSUPER [61].
In silico enhancer ablation perturbations
In the first perturbational scenario (P1), the loss of enhancer activity was simulated by setting to zero reads the levels of the enhancer mark H3K27ac alone. In the second perturbational scenario (P2), a heterochromatin state substitution was performed for all bins inside the enhancer boundaries by setting the levels of all marks to 0 reads except from H3K27me3. This arbitrary silenced-like state was aimed to remove the contributions of accessibility, background promoter and enhancer activities while keeping a plausible level of H3K27me3 in the corresponding chromatin environment. We then followed an enhancer-centric approach, i.e. testing the effect of a single enhancer on the N-genes falling inside the predicted window. To do that, we created new input samples centered at the enhancer signatures in mice and also the enhancers targeted in the CRISPR assays for human K562. Matching outputs were predicted for both for the baseline chromatin landscape inputs and also when silencing the central enhancer (P2), with the simplest CLASTER models using only the chromatin branch. The same architecture was trained separately to predict mouse EU-seq, human K562 RNA-seq and POLR2A ChIP-seq. Only active enhancers were kept for the analyses, where an active enhancer was defined as an annotated enhancer region with an average enrichment of H3K27ac above a predefined threshold, set to be 10 reads. We then integrated the area inside gene boundaries for EU-seq and RNA-seq using the Simpson algorithm for perturbed and control predictions (Additional file 1: Fig. S17b). The domain of integration for POLR2A predictions ranged between −2 kbp and 3 kbp from the TSS to fully capture promoter proximal binding and pausing peaks (Additional file 1: Fig. S17a), with similar results being obtained when integrating from −1 kbp to 2 kbp. No thresholds were imposed on gene expression or POLR2A ChIP-seq levels. Absolute area changes ( \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${\Delta }_{ij}$$\end{document} ) were quantified for all genes \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$j$$\end{document} fully comprised inside the prediction window when in silico silencing the central enhancer \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$i$$\end{document} .
\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${\Delta }_{ij} = |{ A}_{{j}_{baseline}}-{A}_{{j}_{i perturbed}} |$$\end{document}Here, \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${A}_{{j}_{baseline}}$$\end{document} and \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${A}_{{j}_{i perturbed}}$$\end{document} described the area under the curve inside the respective boundaries. The obtained results are detailed in Additional file 2: Table S5-7.
Enhancer-promoter pairing using CLASTER
Absolute area differences ( \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${\Delta }_{ij}$$\end{document} ) for POLR2A and RNA-seq were used as continuous scores reflecting the impact of an enhancer on a given gene. In addition, we computed ratio-to-max scores, i.e. dividing the area difference of each gene by the largest difference measured in the same predicted window ( \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${\Delta }_{ij}$$\end{document} / \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${\Delta }_{i{j}_{max}}$$\end{document} ). These scores introduced the notion of “primary target” of an enhancer in its native genomic environment, and performed better than \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${\Delta }_{ij}$$\end{document} . We also added the negative distance between enhancer and TSS (closer means higher interaction chances) as a baseline predictor of enhancer-promoter pairs. ROC and PR curves exploring all possible thresholds for the scores were then plotted using the enhancer CRISPR knock-out results harmonized by Gschwind et al. [13] as ground truths.
Discrete EP models were finally obtained by assigning each enhancer to its primary target according to POLR2A and RNA predictions and also assigning the closest gene to the enhancer. These boolean predictors were used to build confusion matrices (Additional file 1: Fig. S18).
Regulatory logic perturbations
We focused on the gene Akirin2 since 1) it has a large intergenic region upstream and 2) ATAC-seq and H3K4me3 are extended towards its gene body. To simulate the insertion of a new gene 100 kbp upstream, we introduced the chromatin state ranging between −10 kbp and + 10 kbp from the TSS. The activities of the four marks defining the chromatin state in the same range were divided and multiplied by two to simulate a weaker or stronger promoter respectively. The same symmetric chromatin domain was flipped across the sequence axis to simulate the promoter-proximal profiles of a gene that would be encoded on the negative strand. Finally, we copied a 40 kbp chromatin region containing the mESC superenhancer assigned to KLF4 and inserted it starting 60 kbp upstream of Akirin2. Since the gene had a strong promoter, we used the activity profile of the weak promoter perturbation. This was motivated by experimental results, as greater expression changes in enhancer-dependent genes are expected for genes with weak promoters [8], but also empirically, given that the contribution of H3K4me3 would mask that of H3K27ac (See chromatin branch attribution plots in Fig. 3b, c).
Scoring the contributions of each relative input location to the output
To score the contributions of a given input position towards the expression levels in a target locus, we used the widespread integrated gradients axiomatic attribution method [41] as implemented in the Captum python library. Integrated gradients were computed as the path integral of the gradients along the straight line path connecting a predefined baseline signal \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$x$$\end{document} ’ and a given input \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$x$$\end{document} . The baseline background signal was obtained by averaging 256 samples in each modality. The authors define the integrated gradient along the \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${i}_{th}$$\end{document} dimension as:
\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$IntegratedGrad{s}_{i}(x) ::= ({x}_{i }-{{x}_{i}}^{\prime}) \times {\int }_{\alpha =0}^{1}\frac{\partial F(x^{\prime} + \alpha \times (x-x^{\prime}))}{\partial {x}_{i }}d\alpha$$\end{document}As stated before, the chromatin landscape inputs were described by a matrix of shape (N_chrom marks_, N_input bins_) = (4, 10,001) exploring 1 Mbp in bins of 100 bp. Chromatin structure was given by a matrix of shape (129, 626) describing the same region at a resolution of 1.6 kbp. EU-seq targets were described by a vector of length N_output bins_ = 401 exploring from −200 kbp to 200 kbp relative to the TSS of a gene of interest in bins of 1 kbp. Signed and absolute attribution scores were then obtained for every bin position in the input arrays towards the prediction of every target position (Additional file 1: Fig. S23). To obtain averages over predicted points, attribution maps for each target node were then rolled in steps of 10 bins = 1 kbp to locate the predicted target to the center of the map. High frequency noise arising from the translations of the maps was removed by adding a small uncertainty in the number of bins to roll for the chromatin landscape, adding eps number of bins with −10 bins < eps < + 10 bins. A similar smoothing effect was achieved for structural attributions by applying a Gaussian filter (Additional file 1: Fig. S24). The maps were then symmetrically cropped to keep only the relative positions to the predicted point that had been measured in all predictions, with measurements centered at −200 kbp and 200 kbp setting the limits (Additional file 1: Fig. S23). The same principles were applied to prom-CHiC data, but the original squared matrices exploring 1 Mbp in bins of 5 kbp were kept. A Gaussian filter of sigma = 1 bin was applied to remove noise arising from the matrix translations across the diagonal axis (Additional file 1: Fig. S25-26).
CLASTER
The architecture of CLASTER is illustrated in Additional file 1: Fig. S1 and further detailed in Additional file 2: Table S1. In order to translate the input to the corresponding output, features were extracted separately from both branches of the model, namely the chromatin landscape branch and the chromatin structure branch but using the same principles. These include a set of layers implementing dilated convolutions, residual connections and optional attention heads. Input dimensionality reduction was achieved by scrolling the convolutional kernels in strides of 2. The extracted features at different scales were integrated at a later stage in the network, the fusion module. Further dimensionality reduction and mapping to the output nascent RNA levels was performed with a final set of densely connected layers (MLP). The MLP ended in a hidden vector representation, and finally each node was linearly connected to every single prediction point, i.e. 401 bins at 1 kbp resolution. A final non-trainable ReLU was applied to cut negative values. Predictions before ReLU clipping did not deviate much from zero since that was punished by the loss. Performance metrics were slightly better with ReLU clipping, but that applied to training and test sets equally (Additional file 2: Table S3). The linear regression approach yielded better results in terms of error and correlation with ground truths than training with positive bound activation functions such as Softplus, which led the models to predict the average signal expected at each position as a baseline instead of zero (Additional file 1: Fig. S2). Configuration files to build and run the models using the EIR framework [29] in both training and test settings are available at https://github.com/RasmussenLab/CLASTER.
Training
The network was trained in batches of 64 samples and a learning rate of 1e-4. We used the AdamW optimizer with default parameters [62]. Regularization measures included the use of high dropout values in the MLP layers, setting a given stochastic depth (probability to remove residual block) and the use of SmoothL1 as the loss function. Flipped samples were also added to the training set for data augmentation purposes but were not used for the test set analyses. Chr17 samples were used as a hold out validation, and Chr4 was used as a hold out test. A detailed description of the genes of interest that were used to create our samples can be found in Additional file 2: Table S2. Samples for which we could not obtain Micro-C matching maps were removed from the training, validation or test sets in all settings, and can be found in the same Additional file 2: Table S2.
Benchmarks
Sequence embeddings corresponding to the central 160 kbp of the genomic regions of study were obtained from the backbone of HyenaDNA-160k [17] with pretrained, frozen weights. Code to build HyenaDNA-160k was obtained from the publicly accessible google colab [63] and pretrained weights were obtained from [64]. The input sequences when using HyenaDNA-160k ranged in this case from −80 kbp to 80 kbp after the TSS of the central gene in the observation window. We then trained a simple model or head on top of the frozen embeddings to predict our EU-seq signals at the same resolution as we did with CLASTER (Additional file 1: Fig. S7-8). The head performed an average pooling over the sequence axis to reduce sequence length at a 128 bp resolution. To ease the predictions, a first set of convolutions of kernel 9 and dilation 2 were applied to the resulting embeddings. This enabled the head to identify patterns at a scale of ca. 2 kbp. These patterns were then mapped to the targets using a dense layer.
We tried to perform the regression of the target EU-Seq profiles using the pretrained embeddings from the HyenaDNA-160k, but we encountered a number of challenges. First, in the original manuscript the authors performed mainly classification tasks at a sequence level, but simultaneous regressions of multiple outputs were not explored. Second, HyenaDNA was pretrained on human and not mouse genomic sequences, although we would not expect this to be a core issue. We believe that the pretrained embeddings obtained by next token prediction did not encode enough functional information to perform the EU-Seq predictions, since further fine-tuning of the model was required for our downstream task. This was importantly not the case with Enformer-based embeddings. This fine-tuning requirement might not be unique to HyenaDNA. Finally, Hyena’s design was aimed to keep nucleotide resolution, but this would not be necessary to predict averages of read counts at a coarse, kbp resolution. Despite the low parameter count and high efficiency of the Hyena operator, which reduced drastically the number of trainable parameters, the network’s design yielded a high number of inner sequence representations, the activations at each filter in each layer, that kept sequence length (160,000 columns) while adding an embedding dimension (e.g., 256 rows). Each single sequence representation as a matrix of float32 values required:
\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$160.000 \times 256\ values \times \frac{32\ bits}{1\ value}\times \frac{1\ byte}{8\ bits}\times \frac{1\ Gb}{{1024}^{3}\ bytes} \sim 0.1526\ Gb$$\end{document}With many of them being stored already in the forward pass. This imposed a significant challenge, rapidly increasing memory and computing time requirements, which constrained us to fine-tune Hyena-160k in batches of only 2 sequences in an NVIDIA A100 GPU with 80 GB of RAM. This motivated us to use the smaller Hyena-32k model, for which we could load the pretrained weights and fine-tune the pretrained backbone together with the added head in batches of 16 samples in a reasonable period of time. Targets were cropped to match the new context length to the range −15 kbp to + 15 kbp.
The same procedure was repeated for the Enformer architecture [4]. Sequence embeddings corresponding to the central 160 kbp of the genomic regions of study were obtained from the backbone of Enformer [4] with pretrained, frozen weights. Code to build the Enformer in Pytorch and matching pretrained weights were obtained from [65]. To match the previous scenario when using the Enformer, we used the same 160 kbp sequences as in Hyena and symmetrically prepended and appended N nucleotides as proposed in [11], which acted as masks, to provide the same available information in both settings. The embeddings of the Enformer, of shape (1, 896, 3072) corresponded to \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$896 \times 128\ bp = 114688\ bp \sim 115\ kbp$$\end{document} . The matching targets were therefore cropped to keep the range between −57 kbp to + 57 kbp from the central gene’s TSS to match the Enformer’s prediction window.
Supplementary Information
Additional file 1. Supplementary Figures S1-S27.Additional file 2. Supplementary Tables S1-S8.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1Dalla-Torre H, Gonzalez L, Mendoza-Revilla J, Carranza NL, Grzywaczewski AH, Oteri F, et al. The nucleotide transformer: building and evaluating robust foundation models for human genomics. bio Rxiv. 2023.10.1038/s 41592-024-02523-z PMC 1181077839609566 · doi ↗ · pubmed ↗
- 2Linder J, Srivastava D, Yuan H, Agarwal V, Kelley DR. Predicting RNA-seq coverage from DNA sequence as a unifying model of gene regulation. bio Rxiv. 2023. p. 2023.08.30.555582. Available from: https://www.biorxiv.org/content/biorxiv/early/2023/09/01/2023.08.30.555582. Cited 2023 Nov 2.10.1038/s 41588-024-02053-6PMC 1198535239779956 · doi ↗ · pubmed ↗
- 3Nguyen E, Poli M, Faizi M, Thomas A, Birch-Sykes C, Wornow M, et al. Hyena DNA: long-range genomic sequence modeling at single nucleotide resolution. ar Xiv. 2023. Available from: http://arxiv.org/abs/2306.15794.
- 4Nguyen E, Poli M, Durrant MG, Thomas AW, Kang B, Sullivan J, et al. Sequence modeling and design from molecular to genome scale with Evo. bio Rxiv. 2024. p. 2024.02.27.582234. Available from: https://www.biorxiv.org/content/10.1101/2024.02.27.582234 v 2. Cited 2024 Mar 11.10.1126/science.ado 9336 PMC 1205757039541441 · doi ↗ · pubmed ↗
- 5Brixi G, Durrant MG, Ku J, Poli M, Brockman G, Chang D, et al. Genome modeling and design across all domains of life with Evo 2. bio Rxiv. 2025. 10.1101/2025.02.18.638918.10.1038/s 41586-026-10176-5PMC 1312849141781614 · doi ↗ · pubmed ↗
- 6Sundararajan M, Taly A, Yan Q. Axiomatic attribution for deep networks. ar Xiv. 2017. Available from: http://arxiv.org/abs/1703.01365.
- 7Michaud EJ, Liu Z, Girit U, Tegmark M. The quantization model of neural scaling. In: Thirty-seventh Conference on Neural Information Processing Systems. 2023. Available from: https://openreview.net/forum?id=3tb Tw 2ga 8K.
- 8Sasse A, Ng B, Spiro AE, Tasaki S, Bennett DA, Gaiteri C, et al. Benchmarking of deep neural networks for predicting personal gene expression from DNA sequence highlights shortcomings. bio Rxiv. 2023. 10.1101/2023.03.16.532969.10.1038/s 41588-023-01524-638036778 · doi ↗ · pubmed ↗
