Loading paper
Normalization Layer Per-Example Gradients are Sufficient to Predict Gradient Noise Scale in Transformers | Tomesphere