A Multi-Level Causal Intervention Framework for Mechanistic Interpretability in Variational Autoencoders
Dip Roy, Rajiv Misra, Sanjay Kumar Singh, Anisha Roy

TL;DR
This paper introduces a multilevel causal intervention framework for understanding VAEs, proposes new metrics, and conducts extensive empirical analysis across multiple architectures and datasets.
Contribution
It presents the first general-purpose causal intervention framework for VAEs, along with new metrics and a large empirical study revealing key insights.
Findings
CES negatively correlates with DCI disentanglement within datasets.
KL reweighting in beta-VAE causes capacity bottlenecks on complex datasets.
No single VAE architecture outperforms others across all datasets.
Abstract
Understanding how generative models represent and transform data is a foundational problem in deep learning interpretability. While mechanistic interpretability of discriminative architectures has yielded substantial insights, relatively little work has addressed variational autoencoders (VAEs). This paper presents the first general-purpose multilevel causal intervention framework for mechanistic interpretability of VAEs. The framework comprises four manipulation types: input manipulation, latent-space perturbation, activation patching, and causal mediation analysis. We also define three new quantitative metrics capturing properties not measured by existing disentanglement metrics alone: Causal Effect Strength (CES), intervention specificity, and circuit modularity. We conduct the largest empirical study to date of VAE causal mechanisms across six architectures (standard VAE, beta-VAE,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
