Loading paper
Beyond the Leaderboard: Rethinking Medical Benchmarks for Large Language Models | Tomesphere