CUICurate: A GraphRAG-based Framework for Automated Clinical Concept Curation for NLP applications
Victoria Blake, Jamie Novak, Mathew Miller, Sze-yuan Ooi, Blanca Gallego

TL;DR
CUICurate is a graph-based framework that automates the creation of comprehensive UMLS concept sets for clinical NLP, outperforming manual curation in size and completeness.
Contribution
It introduces a novel GraphRAG approach combining knowledge graph embeddings and large language models for scalable, accurate concept set curation.
Findings
CUICurate produced larger, more complete concept sets than manual benchmarks.
GPT-5 achieved at least 95% recall of gold-standard CUIs across concepts.
The framework was cost-effective, stable, and applicable to clinical NLP tasks.
Abstract
Background: Clinical named entity recognition tools commonly map free text to Unified Medical Language System (UMLS) Concept Unique Identifiers (CUIs). For many downstream tasks, however, the clinically meaningful unit is not a single CUI but a concept set comprising related synonyms, subtypes, and associated concepts. Constructing these sets is labour-intensive, inconsistently performed, and poorly supported by existing tools. Methods We present CUICurate, a graph-based retrieval-augmented generation (GraphRAG) framework for automated UMLS concept set curation. A UMLS knowledge graph (KG) was constructed and embedded for semantic retrieval. Candidate CUIs were retrieved using graph-based expansion and then filtered and classified using large language models (GPT-5 and Qwen3-32B). The framework was evaluated on five lexically heterogeneous clinical concepts against a manually curated…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
