Borrowed Geometry: Cross-Distribution Head-Importance Fingerprints of Frozen Pretrained Gemma 4 31B
Abay Bektursun

TL;DR
This paper investigates how frozen pretrained language model weights transfer to non-text tasks, identifying specific attention heads with cross-distribution importance fingerprints and causal roles.
Contribution
It introduces a cross-distribution importance fingerprint and provides causal validation of specific heads' roles in non-language tasks.
Findings
Certain attention heads are highly influential in non-language tasks.
Head L26.28 significantly impacts the cube-double-play-task1 performance.
Head ablation results demonstrate causal importance of specific heads.
Abstract
Frozen Gemma 4 31B weights pretrained exclusively on text, unmodified, transfer through a thin trainable interface to non-text modalities the substrate has never processed. On the L24--L29 slice (192 attention heads), an English-text TxtCopy attention probe (95 sentences) and per-head ablation impact on four non-language token-pattern tasks (binary copy, associative recall, 1D cellular automaton Rule 90, binary addition) jointly classify four heads -- L26.28, L27.28, L27.2, L27.3 -- as top-tier on both signals. The slice-level joint coincidence is significant under hypergeometric null (, , , ) and survives multiplicity-aware permutation tests (). Pretrained Gemma L26 reaches 60.22% on OGBench cube-double-play-task1 vs ~1% for random-init Gemma (pt at ); a FrozenRandom-GPT2 control with correct scaling also fails.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
