Loading paper
Revisiting the Adam-SGD Gap in LLM Pre-Training: The Role of Large Effective Learning Rates | Tomesphere