# Large language models as medical code selectors: a benchmark using the International Classification of Primary Care

**Authors:** Vinicius Anjos de Almeida, Vinicius de Camargo, Raquel Gómez-Bravo, Kees van Boven, Egbert van der Haring, Marcelo Finger, Luis Fernandez Lopez

PMC · DOI: 10.1093/jamiaopen/ooag017 · 2026-02-13

## TL;DR

This study evaluates how well large language models can assign medical codes using a Brazilian Portuguese dataset and finds that many models perform strongly, especially when optimized.

## Contribution

The paper introduces a benchmark for evaluating LLMs in assigning ICPC-2 codes using a domain-specific search engine and annotated clinical expressions.

## Key findings

- Twenty-eight models achieved an F1-score above 0.8, with ten exceeding 0.85.
- Retriever optimization improved model performance by up to 4 points.
- Smaller models struggled with formatting and input length limitations.

## Abstract

Medical coding structures health-care data for research, quality monitoring, and policy. This study assesses the potential of large language models (LLMs) to assign International Classification of Primary Care, 2nd edition (ICPC-2) codes using the output of a domain-specific search engine.

A dataset of 437 Brazilian Portuguese clinical expressions, each annotated with ICPC-2 codes, was used. A semantic search engine (OpenAI’s text-embedding-3-large) retrieved candidates from 73 563 labeled concepts. Thirty-three LLMs were prompted with each query and retrieved results to select the best-matching ICPC-2 code. Performance was evaluated using F1-score, along with token usage, cost, response time, and format adherence.

Twenty-eight models achieved F1-score>0.8; 10 exceeded 0.85. Top performers included gpt-4.5-preview, o3, and gemini-2.5-pro. Retriever optimization can improve performance by up to 4 points. Most models returned valid codes in the expected format, with reduced hallucinations. Smaller models (<3B parameters) struggled with formatting and input length.

Large language models show strong potential for automating ICPC-2 coding, even without fine-tuning. This work offers a benchmark and highlights challenges, but findings are limited by dataset scope and setup. Broader, multilingual, end-to-end evaluations are needed for clinical validation.

## Full-text entities

- **Diseases:** pneumonia (MESH:D011014), chest pain (MESH:D002637), musculoskeletal pain (MESH:D059352), LLMs (MESH:D007806), fever (MESH:D005334), pulmonary embolism (MESH:D011655), Parkinson's disease (MESH:D010300), fracture (MESH:D050723), pain (MESH:D010146), injury (MESH:D014947), uveitic diseases (MESH:D004194), substance abuse disorder (MESH:D019966), penicillin allergic reaction (MESH:D004342), sepsis (MESH:D018805), heart-related pain (MESH:D000072716), ICD-10-CM (MESH:D008310), hallucination (MESH:D006212), COVID-19 (MESH:D000086382)
- **Chemicals:** DAPO (-)
- **Species:** Homo sapiens (human, species) [taxon 9606]

## Figures

2 figures with captions in the complete paper: https://tomesphere.com/paper/PMC12924630/full.md

---
Source: https://tomesphere.com/paper/PMC12924630