nnterp: A Standardized Interface for Mechanistic Interpretability of Transformers
Cl\'ement Dumas

TL;DR
nnterp offers a standardized, reliable interface for transformer interpretability that works across diverse architectures, combining the accuracy of HuggingFace with the consistency of custom tools.
Contribution
It introduces nnterp, a lightweight wrapper that standardizes transformer analysis, enabling cross-architecture interpretability with validation and built-in methods.
Findings
Supports 50+ models across 16 architectures
Ensures consistent analysis with validation tests
Includes common interpretability tools
Abstract
Mechanistic interpretability research requires reliable tools for analyzing transformer internals across diverse architectures. Current approaches face a fundamental tradeoff: custom implementations like TransformerLens ensure consistent interfaces but require coding a manual adaptation for each architecture, introducing numerical mismatch with the original models, while direct HuggingFace access through NNsight preserves exact behavior but lacks standardization across models. To bridge this gap, we develop nnterp, a lightweight wrapper around NNsight that provides a unified interface for transformer analysis while preserving original HuggingFace implementations. Through automatic module renaming and comprehensive validation testing, nnterp enables researchers to write intervention code once and deploy it across 50+ model variants spanning 16 architecture families. The library includes…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsPower Transformer Diagnostics and Insulation · Explainable Artificial Intelligence (XAI) · Magnetic Properties and Applications
