Loading paper
A Unified View of Attention and Residual Sinks: Outlier-Driven Rescaling is Essential for Transformer Training | Tomesphere