Loading paper
One LR Doesn't Fit All: Heavy-Tail Guided Layerwise Learning Rates for LLMs | Tomesphere