Neural Decompiling of Tracr Transformers
Hannes Thurnherr, Kaspar Riesen

TL;DR
This paper introduces a method to interpret Tracr-compiled transformer weights by generating a dataset of weight-code pairs and training a model to recover the original RASP code, achieving significant accuracy and functional equivalence.
Contribution
It presents the first approach to decompile Tracr transformer weights into RASP code using a dataset and a trained model, advancing interpretability of neural transformers.
Findings
Exact reproduction on over 30% of test objects.
Over 70% of generated programs are functionally equivalent.
Most models produce only a few errors in decompilation.
Abstract
Recently, the transformer architecture has enabled substantial progress in many areas of pattern recognition and machine learning. However, as with other neural network models, there is currently no general method available to explain their inner workings. The present paper represents a first step towards this direction. We utilize \textit{Transformer Compiler for RASP} (Tracr) to generate a large dataset of pairs of transformer weights and corresponding RASP programs. Based on this dataset, we then build and train a model, with the aim of recovering the RASP code from the compiled model. We demonstrate that the simple form of Tracr compiled transformer weights is interpretable for such a decompiler model. In an empirical evaluation, our model achieves exact reproductions on more than 30\% of the test objects, while the remaining 70\% can generally be reproduced with only few errors.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNeural Networks and Applications
