# SEMbap: Bow-free covariance search and data de-correlation

**Authors:** Mario Grassi, Barbara Tarantino

PMC · DOI: 10.1371/journal.pcbi.1012448 · PLOS Computational Biology · 2024-09-11

## TL;DR

This paper introduces SEMbap, a new method for identifying and correcting hidden confounding factors in gene expression data using structural equation models.

## Contribution

The novel contribution is a two-stage deconfounding procedure based on Bow-free Acyclic Paths (BAP) integrated into Structural Equation Models (SEM).

## Key findings

- The BAP search algorithm correctly identifies hidden confounding while controlling false positives.
- SEMbap outperforms existing methods in fitting and perturbation metrics on both simulated and real data.
- The method provides a low-dimensional representation of bow-free edge structures via graph Laplacian PCA.

## Abstract

Large-scale studies of gene expression are commonly influenced by biological and technical sources of expression variation, including batch effects, sample characteristics, and environmental impacts. Learning the causal relationships between observable variables may be challenging in the presence of unobserved confounders. Furthermore, many high-dimensional regression techniques may perform worse. In fact, controlling for unobserved confounding variables is essential, and many deconfounding methods have been suggested for application in a variety of situations. The main contribution of this article is the development of a two-stage deconfounding procedure based on Bow-free Acyclic Paths (BAP) search developed into the framework of Structural Equation Models (SEM), called SEMbap(). In the first stage, an exhaustive search of missing edges with significant covariance is performed via Shipley d-separation tests; then, in the second stage, a Constrained Gaussian Graphical Model (CGGM) is fitted or a low dimensional representation of bow-free edges structure is obtained via Graph Laplacian Principal Component Analysis (gLPCA). We compare four popular deconfounding methods to BAP search approach with applications on simulated and observed expression data. In the former, different structures of the hidden covariance matrix have been replicated. Compared to existing methods, BAP search algorithm is able to correctly identify hidden confounding whilst controlling false positive rate and achieving good fitting and perturbation metrics.

Directed acyclic graphs (DAGs) directed graph, with variables at the vertices and direct causal connections at the edges, can be used to illustrate the causal structure of the SEM, but this does not always mean that all significant factors are considered. We examine a class of models that may include some hidden variables. Specifically, we consider that the graph represents a bow-free acyclic path diagram (BAP), where the directed edges signify direct causal effects, while the bidirected edges suggest hidden confounders. In this paper, we provide a two-step deconfounding technique based on BAP search, which is included into the SEM framework via the SEMbap() function implemented in the R package SEMgraph. Secondly, we want to offer a significant evaluation of the most advanced deconfounding techniques using both synthetic and real data, as well as knowledge of a biological signaling pathway encoded in a DAG, in terms of (i) SEM fitting, (ii) system perturbation, and (iii) recovery performance metrics. The BAP search algorithm outperforms current techniques in accurately detecting hidden confounding, regulating false positive rate, and producing well-fitting and perturbation metrics.

## Full-text entities

- **Genes:** BRCA1 (BRCA1 DNA repair associated) [NCBI Gene 672] {aka BRCAI, BRCC1, BROVCA1, FANCS, IRIS, PNCA4}, FZD4 (frizzled class receptor 4) [NCBI Gene 8322] {aka CD344, EVR1, FEVR, FZD4S, Fz-4, Fz4}, PIK3R2 (phosphoinositide-3-kinase regulatory subunit 2) [NCBI Gene 5296] {aka MPPH, MPPH1, P85B, p85, p85-BETA, p85beta}, ESR2 (estrogen receptor 2) [NCBI Gene 2100] {aka ER-BETA, ESR-BETA, ESRB, ESTRB, Erb, NR3A2}
- **Diseases:** FA (MESH:D005171), NULL (MESH:C564833), SEM (MESH:D004195), cancer (MESH:D009369), Breast Cancer (MESH:D001943), ALS (MESH:D000690), LVs (MESH:D000085343)
- **Chemicals:** BAP (-)
- **Species:** Homo sapiens (human, species) [taxon 9606]

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/PMC11419354/full.md

## Figures

4 figures with captions in the complete paper: https://tomesphere.com/paper/PMC11419354/full.md

## References

56 references — full list in the complete paper: https://tomesphere.com/paper/PMC11419354/full.md

---
Source: https://tomesphere.com/paper/PMC11419354