Surgical Repair of Collapsed Attention Heads in ALiBi Transformers
Palmer Schallon

TL;DR
This paper identifies a systematic attention collapse in ALiBi transformers, introduces a surgical reinitialization method to recover attention head functionality, and demonstrates significant performance improvements through targeted reinitialization.
Contribution
The paper presents a novel surgical reinitialization technique to fix collapsed attention heads in ALiBi transformers, improving model capacity and performance.
Findings
Reinitialization recovers 98.7% of attention heads in BLOOM-1b7.
Reinitializing healthy heads alongside collapsed ones improves perplexity by 25%.
Attention collapse follows a predictable pattern across model scales.
Abstract
We identify a systematic attention collapse pathology in the BLOOM family of transformer language models, where ALiBi positional encoding causes 31-44% of attention heads to attend almost entirely to the beginning-of-sequence token. The collapse follows a predictable pattern across four model scales (560M to 7.1B parameters), concentrating in head indices where ALiBi's slope schedule imposes the steepest distance penalties. We introduce surgical reinitialization: targeted Q/K/V reinitialization with zeroed output projections and gradient-masked freezing of all non-surgical parameters. Applied to BLOOM-1b7 on a single consumer GPU, the technique recovers 98.7% operational head capacity (242 to 379 of 384 heads) in two passes. A controlled comparison with C4 training data confirms that reinitialization -- not corpus content -- drives recovery, and reveals two distinct post-surgical…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsEEG and Brain-Computer Interfaces · Advanced Memory and Neural Computing · Ferroelectric and Negative Capacitance Devices
