Interpretable Causal Representation Learning for Biological Data in the Pathway Space

Jesus de la Fuente; Robert Lehmann; Carlos Ruiz-Arenas; Jan Voges; Irene Marin-Go\~ni; Xabier Martinez-de-Morentin; David Gomez-Cabrero; Idoia Ochoa; Jesper Tegner; Vincenzo Lagani; and Mikel Hernaez

arXiv:2506.12439·cs.LG·June 17, 2025

Interpretable Causal Representation Learning for Biological Data in the Pathway Space

Jesus de la Fuente, Robert Lehmann, Carlos Ruiz-Arenas, Jan Voges, Irene Marin-Go\~ni, Xabier Martinez-de-Morentin, David Gomez-Cabrero, Idoia Ochoa, Jesper Tegner, Vincenzo Lagani, and Mikel Hernaez

PDF

Open Access 3 Reviews

TL;DR

This paper introduces SENA-discrepancy-VAE, a causal representation learning model that produces interpretable latent factors aligned with biological processes, enabling better understanding of gene and drug perturbations.

Contribution

The paper presents a novel CRL model that aligns latent factors with biological processes, improving interpretability without sacrificing predictive accuracy.

Findings

01

Achieves comparable predictive performance to non-interpretable models.

02

Infers biologically meaningful causal latent factors.

03

Provides an efficient encoder for biological process activity levels.

Abstract

Predicting the impact of genomic and drug perturbations in cellular function is crucial for understanding gene functions and drug effects, ultimately leading to improved therapies. To this end, Causal Representation Learning (CRL) constitutes one of the most promising approaches, as it aims to identify the latent factors that causally govern biological systems, thus facilitating the prediction of the effect of unseen perturbations. Yet, current CRL methods fail in reconciling their principled latent representations with known biological processes, leading to models that are not interpretable. To address this major issue, we present SENA-discrepancy-VAE, a model based on the recently proposed CRL method discrepancy-VAE, that produces representations where each latent factor can be interpreted as the (linear) combination of the activity of a (learned) set of biological processes. To this…

Peer Reviews

Decision·ICLR 2025 Poster

Reviewer 01Rating 6Confidence 3

Strengths

- Clear technical contribution that bridges causal representation learning with biological interpretability while maintaining theoretical guarantees - The paper contributes to causal representation learning for Perturb-seq data by introducing biological interpretability through pathway information, while maintaining the theoretical guarantees of discrepency-VAE. - Well written and clear presentation of the method and results. - Thorough ablation studies - Demonstrates interpretability of latent

Weaknesses

- Experimental validation is limited to one dataset and no baselines other than their own ablations and discrepancy-VAE. The paper would benefit from comparisons to at least one of the other listed related works. - No comparison with simpler approaches like post-hoc interpretation of standard discrepancy-VAE latent factors. - While the link between latent factors and BPs is investigated, the quality of the discovered causal graph is not. - Given that the latent factors group a large number of BP

Reviewer 02Rating 6Confidence 4

Strengths

- The paper is well written with strong motivations behind using CRL techniques for biological applications. - The metrics proposed (differential activation, Hits@N) seem to be robust indicators of perturbation effects on BPs and downstream effects. I believe these evaluation metrics are one of the key interesting contributions of this work. - The empirical evaluation is exhaustive and illustrates some interesting observations, especially the representational capacity of the VAE-based SENA metho

Weaknesses

- Although the application in gene regulatory networks is quite interesting, this work seems to be more of an evaluation study of the discrepancy-VAE framework proposed by Zhang et al. I do not see much of an added contribution beyond the original paper besides highlighting the application. - The difference in performance between the SENA variant and the original discrepancyVAE seems to be quite marginal in terms of representation in the double-perturbation scenario. For instance, in Table 2, th

Reviewer 03Rating 6Confidence 3

Strengths

* Clarity: The paper is well written and easy to follow. * Novel Technical Contribution: The paper successfully extends causal representation learning to incorporate domain knowledge while preserving theoretical guarantees. The SENA-δ encoder architecture is a clever solution to balance interpretability and performance. * Practical Impact: The work addresses a significant gap in current causal representation learning methods for biological data, where interpretability is crucial for scientific

Weaknesses

* Limited Biological Validation: While the authors show statistical associations between perturbations and biological processes, there could be more validation using external biological knowledge or experimental validation of the discovered causal relationships. * Hyperparameter Sensitivity: The model introduces an additional hyperparameter λ that significantly impacts performance. While ablation studies are provided, more guidance on selecting this parameter would be valuable (this is importan

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsBiomedical Text Mining and Ontologies · Machine Learning in Bioinformatics · Bayesian Modeling and Causal Inference

MethodsSparse Evolutionary Training