HealthCraft: A Reinforcement Learning Safety Environment for Emergency Medicine
Brandon Dent

TL;DR
HealthCraft introduces a reinforcement learning environment for emergency medicine safety evaluation, highlighting the challenges and current performance of frontier language models in realistic clinical scenarios.
Contribution
It provides the first public RL environment for safety in emergency medicine, with detailed tasks, criteria, and infrastructure to evaluate and improve model safety.
Findings
Claude Opus 4.6 achieves 24.8% success rate on Pass@1.
GPT-5.4 achieves 12.6% success rate on Pass@1.
Performance collapses to near zero on multi-step workflows.
Abstract
Frontier language models are being deployed into clinical workflows faster than the infrastructure to evaluate them safely. Static medical-QA benchmarks miss the failure modes that matter in emergency medicine: trajectory-level safety collapse, tool misuse, and capitulation under sustained clinical pressure. We present HealthCraft, the first public reinforcement-learning environment that rewards trajectory-level safety under realistic emergency-medicine conditions, adapted from Corecraft. It is built on a FHIR R4 world state with 14 entity types and 3,987 seed entities, exposes 24 MCP tools, and defines a dual-layer rubric that zeroes reward whenever any safety-critical criterion is violated. We release 195 tasks across six categories, graded against 2,255 binary criteria (515 safety-critical); a post-hoc 10-task negative-class slate extends this to 205 tasks and 2,337 criteria. V8…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
