Interpretable Causal Representation Learning for Biological Data in the Pathway Space
Jesus de la Fuente, Robert Lehmann, Carlos Ruiz-Arenas, Jan Voges, Irene Marin-Go\~ni, Xabier Martinez-de-Morentin, David Gomez-Cabrero, Idoia Ochoa, Jesper Tegner, Vincenzo Lagani, and Mikel Hernaez

TL;DR
This paper introduces SENA-discrepancy-VAE, a causal representation learning model that produces interpretable latent factors aligned with biological processes, enabling better understanding of gene and drug perturbations.
Contribution
The paper presents a novel CRL model that aligns latent factors with biological processes, improving interpretability without sacrificing predictive accuracy.
Findings
Achieves comparable predictive performance to non-interpretable models.
Infers biologically meaningful causal latent factors.
Provides an efficient encoder for biological process activity levels.
Abstract
Predicting the impact of genomic and drug perturbations in cellular function is crucial for understanding gene functions and drug effects, ultimately leading to improved therapies. To this end, Causal Representation Learning (CRL) constitutes one of the most promising approaches, as it aims to identify the latent factors that causally govern biological systems, thus facilitating the prediction of the effect of unseen perturbations. Yet, current CRL methods fail in reconciling their principled latent representations with known biological processes, leading to models that are not interpretable. To address this major issue, we present SENA-discrepancy-VAE, a model based on the recently proposed CRL method discrepancy-VAE, that produces representations where each latent factor can be interpreted as the (linear) combination of the activity of a (learned) set of biological processes. To this…
Peer Reviews
Decision·ICLR 2025 Poster
- Clear technical contribution that bridges causal representation learning with biological interpretability while maintaining theoretical guarantees - The paper contributes to causal representation learning for Perturb-seq data by introducing biological interpretability through pathway information, while maintaining the theoretical guarantees of discrepency-VAE. - Well written and clear presentation of the method and results. - Thorough ablation studies - Demonstrates interpretability of latent
- Experimental validation is limited to one dataset and no baselines other than their own ablations and discrepancy-VAE. The paper would benefit from comparisons to at least one of the other listed related works. - No comparison with simpler approaches like post-hoc interpretation of standard discrepancy-VAE latent factors. - While the link between latent factors and BPs is investigated, the quality of the discovered causal graph is not. - Given that the latent factors group a large number of BP
- The paper is well written with strong motivations behind using CRL techniques for biological applications. - The metrics proposed (differential activation, Hits@N) seem to be robust indicators of perturbation effects on BPs and downstream effects. I believe these evaluation metrics are one of the key interesting contributions of this work. - The empirical evaluation is exhaustive and illustrates some interesting observations, especially the representational capacity of the VAE-based SENA metho
- Although the application in gene regulatory networks is quite interesting, this work seems to be more of an evaluation study of the discrepancy-VAE framework proposed by Zhang et al. I do not see much of an added contribution beyond the original paper besides highlighting the application. - The difference in performance between the SENA variant and the original discrepancyVAE seems to be quite marginal in terms of representation in the double-perturbation scenario. For instance, in Table 2, th
* Clarity: The paper is well written and easy to follow. * Novel Technical Contribution: The paper successfully extends causal representation learning to incorporate domain knowledge while preserving theoretical guarantees. The SENA-δ encoder architecture is a clever solution to balance interpretability and performance. * Practical Impact: The work addresses a significant gap in current causal representation learning methods for biological data, where interpretability is crucial for scientific
* Limited Biological Validation: While the authors show statistical associations between perturbations and biological processes, there could be more validation using external biological knowledge or experimental validation of the discovered causal relationships. * Hyperparameter Sensitivity: The model introduces an additional hyperparameter λ that significantly impacts performance. While ablation studies are provided, more guidance on selecting this parameter would be valuable (this is importan
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsBiomedical Text Mining and Ontologies · Machine Learning in Bioinformatics · Bayesian Modeling and Causal Inference
MethodsSparse Evolutionary Training
