Loading paper
Suppressing Final Layer Hidden State Jumps in Transformer Pretraining | Tomesphere