# From Guidelines to Guarantees: A Graph-Based Evaluation Harness for Domain-Specific Evaluation of LLMs

**Authors:** Jessica M. Lundin, Usman Nasir Nakakana, Guillaume Chabot-Couture

arXiv: 2508.20810 · 2026-05-18

## TL;DR

This paper introduces a graph-based evaluation framework transforming clinical guidelines into a knowledge graph to assess language models' domain-specific capabilities, ensuring comprehensive, contamination-resistant, and valid evaluation.

## Contribution

The authors develop a novel, scalable evaluation harness that converts structured guidelines into a knowledge graph for dynamic, domain-specific LLM assessment, improving over static datasets.

## Key findings

- Models excel at symptom recognition but struggle with treatment protocols.
- The framework generates clinically grounded multiple-choice questions.
- Evaluation reveals systematic capability gaps across models.

## Abstract

Rigorous evaluation of domain-specific language models requires benchmarks that are comprehensive, contamination-resistant, and maintainable. Static, manually curated datasets do not satisfy these properties. We present a graph-based evaluation harness that transforms structured clinical guidelines into a queryable knowledge graph and dynamically instantiates evaluation queries via graph traversal. The framework provides three guarantees: (1) complete coverage of guideline relationships; (2) surface-form contamination resistance through combinatorial variation; and (3) validity inherited from expert-authored graph structure. Applied to the WHO IMCI guidelines, the harness generates clinically grounded multiple-choice questions spanning symptom recognition, treatment, severity classification, and follow-up care. Evaluation across five language models reveals systematic capability gaps. Models perform well on symptom recognition but show lower accuracy on treatment protocols and clinical management decisions. The framework supports continuous regeneration of evaluation data as guidelines evolve and generalizes to domains with structured decision logic. This provides a scalable foundation for evaluation infrastructure.

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/2508.20810/full.md

## Figures

3 figures with captions in the complete paper: https://tomesphere.com/paper/2508.20810/full.md

## References

19 references — full list in the complete paper: https://tomesphere.com/paper/2508.20810/full.md

---
Source: https://tomesphere.com/paper/2508.20810