Concept-Based Mechanistic Interpretability Using Structured Knowledge Graphs
Sofiia Chorna, Kateryna Tarelkina, Elo\"ise Berthier, Gianni Franchi

TL;DR
This paper introduces BAGEL, a knowledge graph-based framework for global mechanistic interpretability of neural networks, revealing how semantic concepts emerge and interact within models to improve understanding and trust.
Contribution
It presents a novel, scalable framework and visualization tool that extends concept-based interpretability into mechanistic insights using structured knowledge graphs.
Findings
Reveals how high-level concepts propagate through model layers
Identifies latent circuits and interactions underlying decisions
Helps detect spurious correlations and biases
Abstract
While concept-based interpretability methods have traditionally focused on local explanations of neural network predictions, we propose a novel framework and interactive tool that extends these methods into the domain of mechanistic interpretability. Our approach enables a global dissection of model behavior by analyzing how high-level semantic attributes (referred to as concepts) emerge, interact, and propagate through internal model components. Unlike prior work that isolates individual neurons or predictions, our framework systematically quantifies how semantic concepts are represented across layers, revealing latent circuits and information flow that underlie model decision-making. A key innovation is our visualization platform that we named BAGEL (for Bias Analysis with a Graph for global Explanation Layers), which presents these insights in a structured knowledge graph, allowing…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsExplainable Artificial Intelligence (XAI) · Advanced Graph Neural Networks · Adversarial Robustness in Machine Learning
