Loading paper
Uniform Scaling Limits in AdamW-Trained Transformers | Tomesphere