Loading paper
HealthBench Professional: Evaluating Large Language Models on Real Clinician Chats | Tomesphere