BlockCert: Certified Blockwise Extraction of Transformer Mechanisms
Sandro Andric

TL;DR
BlockCert provides a certified, structured method for extracting and verifying transformer mechanisms and local edits, enabling formal guarantees on model behavior and deviations, demonstrated on several language models.
Contribution
Introduces BlockCert, a framework for certified blockwise extraction and local editing of transformers with formal guarantees, bridging interpretability and formal reasoning.
Findings
High per-block coverage and small residual errors across models
Fully stitched TinyLlama model matches baseline perplexity within 6e-5
Framework is feasible for real transformer language models
Abstract
Mechanistic interpretability aspires to reverse-engineer neural networks into explicit algorithms, while model editing seeks to modify specific behaviours without retraining. Both areas are typically evaluated with informal evidence and ad-hoc experiments, with few explicit guarantees about how far an extracted or edited model can drift from the original on relevant inputs. We introduce BlockCert, a framework for certified blockwise extraction of transformer mechanisms, and outline how a lightweight extension can support certified local edits. Given a pre-trained transformer and a prompt distribution, BlockCert extracts structured surrogate implementations for residual blocks together with machine-checkable certificates that bound approximation error, record coverage metrics, and hash the underlying artifacts. We formalize a simple Lipschitz-based composition theorem in Lean 4 that…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsExplainable Artificial Intelligence (XAI) · Adversarial Robustness in Machine Learning · Topic Modeling
