Sensitivity-Positional Co-Localization in GQA Transformers
Manoj Chandrashekar Rao

TL;DR
This paper investigates the relationship between task-sensitive and positional encoding-sensitive layers in GQA transformers, revealing anti-localization and demonstrating improved performance through targeted interventions.
Contribution
It introduces novel metrics and methods to identify and manipulate layers in GQA transformers, challenging the co-localization hypothesis and improving benchmark performance.
Findings
Task-sensitive layers are concentrated in the late network layers.
RoPE-influential layers dominate early network layers.
Targeted interventions outperform alternative configurations by 4-16 percentage points.
Abstract
We investigate a fundamental structural question in Grouped Query Attention (GQA) transformers: do the layers most sensitive to task correctness coincide with the layers where positional encoding adaptation has the greatest leverage? We term this the co-localization hypothesis and test it on Llama 3.1 8B, a 32-layer GQA model with a 4:1 query-to-key-value head ratio. We introduce \LSLORA, which restricts LoRA adaptation to layers identified via a novel correctness-differential hidden-state metric, and GARFA (GQA-Aware RoPE Frequency Adaptation), which attaches 8 learnable per-KV-head scalar multipliers to each targeted layer. Contrary to the co-localization hypothesis, we discover strong anti-localization: task-sensitive layers concentrate in the late network () while RoPE-influential layers dominate the early network (), yielding Spearman…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
