Causal Concept Graphs in LLM Latent Space for Stepwise Reasoning
Md Muntaqim Meherab, Noor Islam S. Mohammad, and Faiza Feroz

TL;DR
This paper introduces Causal Concept Graphs (CCG), a method to model and analyze causal relationships between concepts in language models during multi-step reasoning, improving interpretability and intervention effectiveness.
Contribution
It combines task-conditioned autoencoders with differentiable structure learning to create sparse, domain-specific causal graphs that outperform baselines in reasoning tasks.
Findings
CCG achieves higher Causal Fidelity Scores than baselines.
Learned graphs are sparse, domain-specific, and stable across seeds.
Interventions based on CCG induce larger downstream effects.
Abstract
Sparse autoencoders can localize where concepts live in language models, but not how they interact during multi-step reasoning. We propose Causal Concept Graphs (CCG): a directed acyclic graph over sparse, interpretable latent features, where edges capture learned causal dependencies between concepts. We combine task-conditioned sparse autoencoders for concept discovery with DAGMA-style differentiable structure learning for graph recovery and introduce the Causal Fidelity Score (CFS) to evaluate whether graph-guided interventions induce larger downstream effects than random ones. On ARC-Challenge, StrategyQA, and LogiQA with GPT-2 Medium, across five seeds ( paired runs), CCG achieves , outperforming ROME-style tracing (), SAE-only ranking (), and a random baseline (), with after Bonferroni correction.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
