Loading paper
Toward Understanding Why Adam Converges Faster Than SGD for Transformers | Tomesphere