BlockCert: Certified Blockwise Extraction of Transformer Mechanisms

Sandro Andric

arXiv:2511.17645·cs.LG·November 26, 2025

BlockCert: Certified Blockwise Extraction of Transformer Mechanisms

Sandro Andric

PDF

Open Access

TL;DR

BlockCert provides a certified, structured method for extracting and verifying transformer mechanisms and local edits, enabling formal guarantees on model behavior and deviations, demonstrated on several language models.

Contribution

Introduces BlockCert, a framework for certified blockwise extraction and local editing of transformers with formal guarantees, bridging interpretability and formal reasoning.

Findings

01

High per-block coverage and small residual errors across models

02

Fully stitched TinyLlama model matches baseline perplexity within 6e-5

03

Framework is feasible for real transformer language models

Abstract

Mechanistic interpretability aspires to reverse-engineer neural networks into explicit algorithms, while model editing seeks to modify specific behaviours without retraining. Both areas are typically evaluated with informal evidence and ad-hoc experiments, with few explicit guarantees about how far an extracted or edited model can drift from the original on relevant inputs. We introduce BlockCert, a framework for certified blockwise extraction of transformer mechanisms, and outline how a lightweight extension can support certified local edits. Given a pre-trained transformer and a prompt distribution, BlockCert extracts structured surrogate implementations for residual blocks together with machine-checkable certificates that bound approximation error, record coverage metrics, and hash the underlying artifacts. We formalize a simple Lipschitz-based composition theorem in Lean 4 that…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsExplainable Artificial Intelligence (XAI) · Adversarial Robustness in Machine Learning · Topic Modeling