Retrieval-Augmented Clinical Benchmarking for Contextual Model Testing in Kenyan Primary Care: A Methodology Paper

Fred Mutisya (1,2); Shikoh Gitau (1); Christine Syovata (2); Diana Oigara (2); Ibrahim Matende (2); Muna Aden (2); Munira Ali (2); Ryan Nyotu (2); Diana Marion (2); Job Nyangena (2); Nasubo Ongoma (1); Keith Mbae (1); Elizabeth Wamicha (1); Eric Mibuari (1); Jean Philbert Nsengemana (3); Talkmore Chidede (4) ((1) Qhala; Nairobi; Kenya; (2) Kenya Medical Association; Nairobi; Kenya; (3) Africa CDC; (4) AfCFTA)

arXiv:2507.14615·cs.CL·July 22, 2025

Retrieval-Augmented Clinical Benchmarking for Contextual Model Testing in Kenyan Primary Care: A Methodology Paper

Fred Mutisya (1,2), Shikoh Gitau (1), Christine Syovata (2), Diana Oigara (2), Ibrahim Matende (2), Muna Aden (2), Munira Ali (2), Ryan Nyotu (2), Diana Marion (2), Job Nyangena (2), Nasubo Ongoma (1), Keith Mbae (1), Elizabeth Wamicha (1), Eric Mibuari (1)

PDF

TL;DR

This paper introduces a methodology for creating a Kenyan-specific clinical benchmark dataset using retrieval-augmented generation to evaluate LLMs in local primary care settings, highlighting performance gaps and ensuring cultural relevance.

Contribution

It presents a novel, guideline-driven framework for developing localized clinical benchmarks and evaluation metrics for LLMs in African healthcare contexts.

Findings

01

LLMs perform significantly worse on Kenyan medical content compared to US benchmarks.

02

The dataset includes thousands of question-answer pairs aligned with local guidelines.

03

New evaluation metrics assess reasoning, safety, and adaptability in clinical scenarios.

Abstract

Large Language Models(LLMs) hold promise for improving healthcare access in low-resource settings, but their effectiveness in African primary care remains underexplored. We present a methodology for creating a benchmark dataset and evaluation framework focused on Kenyan Level 2 and 3 clinical care. Our approach uses retrieval augmented generation (RAG) to ground clinical questions in Kenya's national guidelines, ensuring alignment with local standards. These guidelines were digitized, chunked, and indexed for semantic retrieval. Gemini Flash 2.0 Lite was then prompted with guideline excerpts to generate realistic clinical scenarios, multiple-choice questions, and rationale based answers in English and Swahili. Kenyan physicians co-created and refined the dataset, and a blinded expert review process ensured clinical accuracy, clarity, and cultural appropriateness. The resulting Alama…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.