TL;DR
Seq2Exp is a novel neural network model that improves gene expression prediction by explicitly discovering regulatory elements, capturing causal relationships, and outperforming existing methods in accuracy and region discovery.
Contribution
The paper introduces Seq2Exp, a new model that explicitly discovers regulatory elements and models causal relationships for gene expression prediction.
Findings
Seq2Exp outperforms existing baselines in prediction accuracy.
Seq2Exp effectively discovers influential regulatory regions.
The approach captures causal relationships between DNA, epigenomic signals, and gene expression.
Abstract
We consider the problem of predicting gene expressions from DNA sequences. A key challenge of this task is to find the regulatory elements that control gene expressions. Here, we introduce Seq2Exp, a Sequence to Expression network explicitly designed to discover and extract regulatory elements that drive target gene expression, enhancing the accuracy of the gene expression prediction. Our approach captures the causal relationship between epigenomic signals, DNA sequences and their associated regulatory elements. Specifically, we propose to decompose the epigenomic signals and the DNA sequence conditioned on the causal active regulatory elements, and apply an information bottleneck with the Beta distribution to combine their effects while filtering out non-causal components. Our experiments demonstrate that Seq2Exp outperforms existing baselines in gene expression prediction tasks and…
Peer Reviews
Decision·ICLR 2025 Oral
When I compare this paper with other benchmarks (except EPInformer) in the field and for the same problem definition - I would rate this paper as high in terms of originality & significance. Yes, this work shows that gene expression prediction is not a siloed process, and that there are causal relationships between DNA sequences, epigenomic data, regulatory elements, and surrounding causal / non-causal parts of the input data. Also, deep learning algorithms can learn from these relationships and
This follows from my comments on the strengths of the paper. After reading the EPInformer paper (released in early 2024), my rating reduces. Was this idea a new approach? I respectfully decline. EPInformer highlights the same intent, experimental approach and even comes very close in performance benchmarks & standard metrics. Not sure if this paper has added anything new for the ICLR community? To me as a reader, it has only reinforced the claims of EPInformer. I would have been happier if they
The paper showcases a successful application of the recently-proposed Selective State Space models (specifically, MAMBA-based Caduceus) on relatively long sequences (200k). The model goes beyond most current approaches by combining sequence with epigenetic signaling in a single, jointly optimized architecture. The architecture is sound, and uses information bottleneck principle to promote sparsity of the mask for selecting sequence regions, and then a straight-through estimator to allow differ
While the manuscript is mostly well written, it could use a more detailed explanation in several places, specifically related to: - gene expression Y: it is often just described as "target variable Y", or "$Y \in R$. From 5.1.4 we read that ultimately the model is applied to a 200kbp region around a specific target gene, and one can understand that Y is the expression of that gene; it would be helpful to provide this information clearly earlier in the paper. - model architecture: it also deferr
The main idea is well articulated and sensible. The paper is very clearly written. The related work section does a reasonable job of covering the recent, relevant literature. The model is well formulated and modeling choices are well motivated in the text. The empirical results are strong.
I think the sentence "Gene expression prediction is one of the fundamental tasks in bioinformatics." should be supported with a citation or two from some of the seminal works in this area, e.g., Nir Friedman's 2000 paper on predicting gene expression from promoter sequences. The task description should be expanded. It's not clear, from the description, how the inputs are registered to the output value Y. Is the idea that the length-L input window is centered on the TSS? To my mind, it would
Code & Models
Videos
Taxonomy
MethodsLib
