CUICurate: A GraphRAG-based Framework for Automated Clinical Concept Curation for NLP applications

Victoria Blake; Jamie Novak; Mathew Miller; Sze-yuan Ooi; Blanca Gallego

arXiv:2602.17949·cs.CL·May 15, 2026

CUICurate: A GraphRAG-based Framework for Automated Clinical Concept Curation for NLP applications

Victoria Blake, Jamie Novak, Mathew Miller, Sze-yuan Ooi, Blanca Gallego

PDF

TL;DR

CUICurate is a graph-based framework that automates the creation of comprehensive UMLS concept sets for clinical NLP, outperforming manual curation in size and completeness.

Contribution

It introduces a novel GraphRAG approach combining knowledge graph embeddings and large language models for scalable, accurate concept set curation.

Findings

01

CUICurate produced larger, more complete concept sets than manual benchmarks.

02

GPT-5 achieved at least 95% recall of gold-standard CUIs across concepts.

03

The framework was cost-effective, stable, and applicable to clinical NLP tasks.

Abstract

Background: Clinical named entity recognition tools commonly map free text to Unified Medical Language System (UMLS) Concept Unique Identifiers (CUIs). For many downstream tasks, however, the clinically meaningful unit is not a single CUI but a concept set comprising related synonyms, subtypes, and associated concepts. Constructing these sets is labour-intensive, inconsistently performed, and poorly supported by existing tools. Methods We present CUICurate, a graph-based retrieval-augmented generation (GraphRAG) framework for automated UMLS concept set curation. A UMLS knowledge graph (KG) was constructed and embedded for semantic retrieval. Candidate CUIs were retrieved using graph-based expansion and then filtered and classified using large language models (GPT-5 and Qwen3-32B). The framework was evaluated on five lexically heterogeneous clinical concepts against a manually curated…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.