Loading paper
ScholarGym: Benchmarking Large Language Model Capabilities in the Information-Gathering Stage of Deep Research | Tomesphere