Large language models as medical code selectors: a benchmark using the International Classification of Primary Care
Vinicius Anjos de Almeida, Vinicius de Camargo, Raquel Gómez-Bravo, Kees van Boven, Egbert van der Haring, Marcelo Finger, Luis Fernandez Lopez

TL;DR
This study evaluates how well large language models can assign medical codes using a Brazilian Portuguese dataset and finds that many models perform strongly, especially when optimized.
Contribution
The paper introduces a benchmark for evaluating LLMs in assigning ICPC-2 codes using a domain-specific search engine and annotated clinical expressions.
Findings
Twenty-eight models achieved an F1-score above 0.8, with ten exceeding 0.85.
Retriever optimization improved model performance by up to 4 points.
Smaller models struggled with formatting and input length limitations.
Abstract
Medical coding structures health-care data for research, quality monitoring, and policy. This study assesses the potential of large language models (LLMs) to assign International Classification of Primary Care, 2nd edition (ICPC-2) codes using the output of a domain-specific search engine. A dataset of 437 Brazilian Portuguese clinical expressions, each annotated with ICPC-2 codes, was used. A semantic search engine (OpenAI’s text-embedding-3-large) retrieved candidates from 73 563 labeled concepts. Thirty-three LLMs were prompted with each query and retrieved results to select the best-matching ICPC-2 code. Performance was evaluated using F1-score, along with token usage, cost, response time, and format adherence. Twenty-eight models achieved F1-score>0.8; 10 exceeded 0.85. Top performers included gpt-4.5-preview, o3, and gemini-2.5-pro. Retriever optimization can improve performance…
Genes, proteins, chemicals, diseases, species, mutations and cell lines named across the full text — each resolved to its canonical identifier and authoritative record.
Click any figure to enlarge with its caption.
Figure 1
Figure 2Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMachine Learning in Healthcare · Medical Coding and Health Information · Biomedical Text Mining and Ontologies
