CodeClinic: Evaluating Automation of Coding Skills for Clinical Reasoning Agents
Timothy Ossowski, Xinchi Liu, Danyal Maqbool, Vaibhav Dhanuka, Sheng Zhang, Hoifung Poon, Majid Afshar, Tyler Bradshaw, Junjie Hu

TL;DR
CodeClinic introduces a benchmark and method for evaluating and synthesizing reusable clinical reasoning skills from LLMs, reducing reliance on fixed toolboxes and improving reasoning consistency.
Contribution
The paper presents a new benchmark based on MIMIC-IV for assessing LLMs in clinical reasoning and proposes an autoformalization pipeline to generate verified Python skill libraries.
Findings
Libraries improve consistency over zero-shot code generation.
Token usage per query reduced by up to 40%.
Benchmark covers complex multi-step clinical reasoning tasks.
Abstract
Clinical reasoning agents based on large language models (LLMs) aim to automate tasks such as intensive care unit (ICU) monitoring and patient state tracking from electronic health records (EHRs). Existing systems typically rely on manually curated clinical tools or skills for concepts such as sepsis detection and organ failure assessment. However, maintaining these tool libraries requires substantial expert effort, while zero-shot querying or code generation often produces inefficient and unreliable reasoning chains, especially under institution-specific clinical policies. We introduce CodeClinic, a benchmark built on MIMIC-IV for evaluating whether LLM agents can synthesize and compose reusable clinical skills instead of relying on fixed toolboxes. The benchmark contains two complementary tasks: longitudinal ICU surveillance and compositional information seeking. The longitudinal…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
