Explaining the Explainer: Understanding the Inner Workings of Transformer-based Symbolic Regression Models
Arco van Breda, Erman Acar

TL;DR
This paper introduces PATCHES, an evolutionary algorithm for discovering circuits in transformer-based symbolic regression models, providing the first circuit-level understanding and validating the causal relevance of identified circuits.
Contribution
The paper presents PATCHES, a novel method for mechanistic interpretability in SR transformers, and demonstrates its effectiveness in identifying functionally correct circuits.
Findings
PATCHES successfully isolates 28 circuits in SR transformers
Mean patching with performance evaluation best identifies correct circuits
Logit attribution and probing mainly capture correlational rather than causal features
Abstract
Following their success across many domains, transformers have also proven effective for symbolic regression (SR); however, the internal mechanisms underlying their generation of mathematical operators remain largely unexplored. Although mechanistic interpretability has successfully identified circuits in language and vision models, it has not yet been applied to SR. In this article, we introduce PATCHES, an evolutionary circuit discovery algorithm that identifies compact and correct circuits for SR. Using PATCHES, we isolate 28 circuits, providing the first circuit-level characterisation of an SR transformer. We validate these findings through a robust causal evaluation framework based on key notions such as faithfulness, completeness, and minimality. Our analysis shows that mean patching with performance-based evaluation most reliably isolates functionally correct circuits. In…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsEvolutionary Algorithms and Applications · Explainable Artificial Intelligence (XAI) · Machine Learning in Materials Science
