TL;DR
This paper investigates how architectural obfuscation techniques affect the interpretability of large language models, showing that while they preserve overall function, they hinder detailed mechanistic understanding.
Contribution
It provides a systematic analysis of obfuscation's impact on interpretability, revealing that it degrades circuit-level insights without affecting model performance.
Findings
Obfuscation alters attention head activation patterns.
Layer-wise computational graphs remain intact.
Fine-grained interpretability is significantly impaired.
Abstract
Architectural obfuscation - e.g., permuting hidden-state tensors, linearly transforming embedding tables, or remapping tokens - has recently gained traction as a lightweight substitute for heavyweight cryptography in privacy-preserving large-language-model (LLM) inference. While recent work has shown that these techniques can be broken under dedicated reconstruction attacks, their impact on mechanistic interpretability has not been systematically studied. In particular, it remains unclear whether scrambling a network's internal representations truly thwarts efforts to understand how the model works, or simply relocates the same circuits to an unfamiliar coordinate system. We address this gap by analyzing a GPT-2-small model trained from scratch with a representative obfuscation map. Assuming the obfuscation map is private and the original basis is hidden (mirroring an honest-but-curious…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
