Mechanistic Interpretability in the Presence of Architectural Obfuscation

Marcos Florencio; Thomas Barton

arXiv:2506.18053·cs.CR·June 24, 2025

Mechanistic Interpretability in the Presence of Architectural Obfuscation

Marcos Florencio, Thomas Barton

PDF

1 Repo

TL;DR

This paper investigates how architectural obfuscation techniques affect the interpretability of large language models, showing that while they preserve overall function, they hinder detailed mechanistic understanding.

Contribution

It provides a systematic analysis of obfuscation's impact on interpretability, revealing that it degrades circuit-level insights without affecting model performance.

Findings

01

Obfuscation alters attention head activation patterns.

02

Layer-wise computational graphs remain intact.

03

Fine-grained interpretability is significantly impaired.

Abstract

Architectural obfuscation - e.g., permuting hidden-state tensors, linearly transforming embedding tables, or remapping tokens - has recently gained traction as a lightweight substitute for heavyweight cryptography in privacy-preserving large-language-model (LLM) inference. While recent work has shown that these techniques can be broken under dedicated reconstruction attacks, their impact on mechanistic interpretability has not been systematically studied. In particular, it remains unclear whether scrambling a network's internal representations truly thwarts efforts to understand how the model works, or simply relocates the same circuits to an unfamiliar coordinate system. We address this gap by analyzing a GPT-2-small model trained from scratch with a representative obfuscation map. Assuming the obfuscation map is private and the original basis is hidden (mirroring an honest-but-curious…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

themarcosf/mech-interp-paper
jaxOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.