Entanglement as Memory: Mechanistic Interpretability of Quantum Language Models
Nathan Roll

TL;DR
This study investigates the internal strategies of quantum language models, revealing that entanglement-based strategies are distinct and sensitive to noise, unlike classical strategies which are more robust.
Contribution
First mechanistic interpretability analysis of quantum language models, demonstrating how entanglement encodes context and how strategies degrade under noise.
Findings
Single-qubit models are classically simulable and mimic classical strategies.
Two-qubit models with entanglement learn a distinct, entanglement-based strategy.
Entanglement strategies degrade on real hardware due to noise.
Abstract
Quantum language models have shown competitive performance on sequential tasks, yet whether trained quantum circuits exploit genuinely quantum resources -- or merely embed classical computation in quantum hardware -- remains unknown. Prior work has evaluated these models through endpoint metrics alone, without examining the memory strategies they actually learn internally. We introduce the first mechanistic interpretability study of quantum language models, combining causal gate ablation, entanglement tracking, and density-matrix interchange interventions on a controlled long-range dependency task. We find that single-qubit models are exactly classically simulable and converge to the same geometric strategy as matched classical baselines, while two-qubit models with entangling gates learn a representationally distinct strategy that encodes context in inter-qubit entanglement --…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
