Suppressing Final Layer Hidden State Jumps in Transformer Pretraining
Keigo Shibata, Kazuki Yano, Ryosuke Takahashi, Jaesung Lee, Wataru Ikeda, Jun Suzuki

TL;DR
This paper investigates the large angular distance jumps in the final layer of Transformer models during pretraining, introduces a regularizer to suppress these jumps, and demonstrates improved task performance with this method.
Contribution
It introduces a novel metric for jump strength, proposes the JREG regularizer to suppress final layer jumps, and shows improved performance in Llama-based models.
Findings
Jumps are prevalent and amplified in final layers across models.
JREG effectively reduces jump strength during pretraining.
Models trained with JREG outperform baseline models on tasks.
Abstract
This paper discusses the internal behavior of Transformer language models. Many recent pre-trained models have been reported to exhibit only slight changes in the angular distance between the input and output hidden state vectors in the middle Transformer layers, despite a disproportionately large ``jump'' in the angular distance occurring in or around the final Transformer layer. To characterize this, we first introduce a quantitative metric for the jump strength around the final layer, and then demonstrate its prevalence across many open-weight models, as well as its amplification throughout pre-training. Assuming such jumps indicate an undesirable property, we propose the jump-suppressing regularizer (JREG) which penalizes this jump during pre-training, thereby encouraging more balanced capability usage across the middle layers. Empirical evaluations of three model sizes of…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsTopic Modeling · Domain Adaptation and Few-Shot Learning · Advanced Graph Neural Networks
