Tracr-Injection: Distilling Algorithms into Pre-trained Language Models
Tom\'as Vergara-Browne, \'Alvaro Soto

TL;DR
This paper introduces tracr-injection, a novel method to embed RASP algorithms into pre-trained language models, enhancing interpretability and out-of-distribution performance by creating a symbolic subspace within the model.
Contribution
The paper presents tracr-injection, a new technique for distilling RASP algorithms into language models, bridging the gap between theoretical symbolic capabilities and practical learnability.
Findings
Injected algorithms form an interpretable subspace in the residual stream
The method improves out-of-distribution performance
Creates a symbolic mechanism within the model
Abstract
Motivated by the surge of large language models, there has been a push to formally characterize the symbolic abilities intrinsic to the transformer architecture. A programming language, called RASP, has been proposed, which can be directly compiled into transformer weights to implement these algorithms. However, the tasks that can be implemented in RASP are often uncommon to learn from natural unsupervised data, showing a mismatch between theoretical capabilities of the transformer architecture, and the practical learnability of these capabilities from unsupervised data. We propose tracr-injection, a method that allows us to distill algorithms written in RASP directly into a pre-trained language model. We showcase our method by injecting 3 different algorithms into a language model. We show how our method creates an interpretable subspace within the model's residual stream, which can be…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Speech Recognition and Synthesis
