TL;DR
This paper introduces Retrieval-Augmented Set Completion (RASC), a method that improves clinical value set code generation by retrieving similar existing sets and applying classifiers, outperforming large language models in accuracy.
Contribution
The paper presents RASC, a retrieval-augmented approach that enhances clinical code set generation, with extensive benchmarking and open-source code availability.
Findings
RASC achieves higher AUROC and F1 scores than baseline models.
Retrieval reduces irrelevant candidate codes from ~12 to ~3-4.
Performance gap widens with larger value sets, favoring RASC.
Abstract
Clinical value set authoring -- the task of identifying all codes in a standardized vocabulary that define a clinical concept -- is a recurring bottleneck in clinical quality measurement and phenotyping. A natural approach is to prompt a large language model (LLM) to generate the required codes directly, but structured clinical vocabularies are large, version-controlled, and not reliably memorized during pretraining. We propose Retrieval-Augmented Set Completion (RASC): retrieve the most similar existing value sets from a curated corpus to form a candidate pool, then apply a classifier to each candidate code. Theoretically, retrieve-and-select can reduce statistical complexity by shrinking the effective output space from the full vocabulary to a much smaller retrieved candidate pool. We demonstrate the utility of RASC on 11,803 publicly available VSAC value sets, constructing the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
