Large language models as medical code selectors: a benchmark using the International Classification of Primary Care

Vinicius Anjos de Almeida; Vinicius de Camargo; Raquel Gómez-Bravo; Kees van Boven; Egbert van der Haring; Marcelo Finger; Luis Fernandez Lopez

PMC · DOI:10.1093/jamiaopen/ooag017·February 13, 2026

Large language models as medical code selectors: a benchmark using the International Classification of Primary Care

Vinicius Anjos de Almeida, Vinicius de Camargo, Raquel Gómez-Bravo, Kees van Boven, Egbert van der Haring, Marcelo Finger, Luis Fernandez Lopez

PDF

Open Access

TL;DR

This study evaluates how well large language models can assign medical codes using a Brazilian Portuguese dataset and finds that many models perform strongly, especially when optimized.

Contribution

The paper introduces a benchmark for evaluating LLMs in assigning ICPC-2 codes using a domain-specific search engine and annotated clinical expressions.

Findings

01

Twenty-eight models achieved an F1-score above 0.8, with ten exceeding 0.85.

02

Retriever optimization improved model performance by up to 4 points.

03

Smaller models struggled with formatting and input length limitations.

Abstract

Medical coding structures health-care data for research, quality monitoring, and policy. This study assesses the potential of large language models (LLMs) to assign International Classification of Primary Care, 2nd edition (ICPC-2) codes using the output of a domain-specific search engine. A dataset of 437 Brazilian Portuguese clinical expressions, each annotated with ICPC-2 codes, was used. A semantic search engine (OpenAI’s text-embedding-3-large) retrieved candidates from 73 563 labeled concepts. Thirty-three LLMs were prompted with each query and retrieved results to select the best-matching ICPC-2 code. Performance was evaluated using F1-score, along with token usage, cost, response time, and format adherence. Twenty-eight models achieved F1-score>0.8; 10 exceeded 0.85. Top performers included gpt-4.5-preview, o3, and gemini-2.5-pro. Retriever optimization can improve performance…

Linked entities

Genes, proteins, chemicals, diseases, species, mutations and cell lines named across the full text — each resolved to its canonical identifier and authoritative record.

Figures2

Click any figure to enlarge with its caption.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMachine Learning in Healthcare · Medical Coding and Health Information · Biomedical Text Mining and Ontologies