Loading paper
Why Adam Works Better with $\beta_1 = \beta_2$: The Missing Gradient Scale Invariance Principle | Tomesphere