# FHIR-AgentEval: A Modular Sandbox for Benchmarking Clinical LLM Agents with an Evaluation of Memory-Augmented Configurations

**Authors:** Youssef Mokssit, Kamalakkannan Ravi, Mengshu Nie, Junyoung Kim, Cong Liu

PMC · DOI: 10.21203/rs.3.rs-8746188/v1 · Research Square · 2026-02-09

## TL;DR

This paper introduces FHIR-AgentEval, a tool for testing AI systems in healthcare workflows, showing that memory features improve performance.

## Contribution

The novel contribution is FHIR-AgentEval, a modular benchmark for evaluating clinical LLM agents with memory-augmented configurations.

## Key findings

- Memory-augmented agents improved task success by 9.1% over baselines.
- Memory reduced strategic failures like incorrect tool selection.
- The sandbox supports 43 modular tasks for realistic clinical workflows.

## Abstract

Healthcare data exchange increasingly relies on HL7 FHIR, but FHIR’s implementation complexity creates barriers for clinical workflows. Large language model (LLM) agents could bridge this gap by translating natural language requests into structured FHIR operations, yet their reliability remains unproven. We present FHIR-AgentEval, an extensible evaluation sandbox comprising 43 modular tasks for benchmarking LLM agents on realistic appointment management and genetic testing workflows. Each task executes against a resettable FHIR server with custom deterministic validation of both agent responses and resulting server state. We run an ablation study of five agent configurations, varying access to an on-demand FHIR R4 specifications server and long-term memory trained with or without specification grounding. Across four experimental settings, memory consistently improves task success and reduces strategic failures such as incorrect tool selection and resource-type confusion. On held-out tasks, the best memory configuration improves success by 9.1% over baseline, offering a potential pathway toward more robust clinical deployment.

## Full-text entities

- **Diseases:** LLM (MESH:D007806)
- **Species:** Homo sapiens (human, species) [taxon 9606]

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/PMC12919212/full.md

## Figures

12 figures with captions in the complete paper: https://tomesphere.com/paper/PMC12919212/full.md

## References

24 references — full list in the complete paper: https://tomesphere.com/paper/PMC12919212/full.md

---
Source: https://tomesphere.com/paper/PMC12919212