Loading paper
Attention Sinks Induce Gradient Sinks: Massive Activations as Gradient Regulators in Transformers | Tomesphere