Retrieve, Then Classify: Corpus-Grounded Automation of Clinical Value Set Authoring

Sumit Mukherjee; Juan Shu; Nairwita Mazumder; Tate Kernell; Celena Wheeler; Shannon Hastings; Chris Sidey-Gibbons

arXiv:2604.14616·cs.CL·April 17, 2026

Retrieve, Then Classify: Corpus-Grounded Automation of Clinical Value Set Authoring

Sumit Mukherjee, Juan Shu, Nairwita Mazumder, Tate Kernell, Celena Wheeler, Shannon Hastings, Chris Sidey-Gibbons

PDF

1 Repo

TL;DR

This paper introduces Retrieval-Augmented Set Completion (RASC), a method that improves clinical value set code generation by retrieving similar existing sets and applying classifiers, outperforming large language models in accuracy.

Contribution

The paper presents RASC, a retrieval-augmented approach that enhances clinical code set generation, with extensive benchmarking and open-source code availability.

Findings

01

RASC achieves higher AUROC and F1 scores than baseline models.

02

Retrieval reduces irrelevant candidate codes from ~12 to ~3-4.

03

Performance gap widens with larger value sets, favoring RASC.

Abstract

Clinical value set authoring -- the task of identifying all codes in a standardized vocabulary that define a clinical concept -- is a recurring bottleneck in clinical quality measurement and phenotyping. A natural approach is to prompt a large language model (LLM) to generate the required codes directly, but structured clinical vocabularies are large, version-controlled, and not reliably memorized during pretraining. We propose Retrieval-Augmented Set Completion (RASC): retrieve the $K$ most similar existing value sets from a curated corpus to form a candidate pool, then apply a classifier to each candidate code. Theoretically, retrieve-and-select can reduce statistical complexity by shrinking the effective output space from the full vocabulary to a much smaller retrieved candidate pool. We demonstrate the utility of RASC on 11,803 publicly available VSAC value sets, constructing the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

mukhes3/RASC
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.