Large Language Models as Medical Codes Selectors: a benchmark using the International Classification of Primary Care

Vinicius Anjos de Almeida; Vinicius de Camargo; Raquel G\'omez-Bravo; Egbert van der Haring; Kees van Boven; Marcelo Finger; Luis Fernandez Lopez

arXiv:2507.14681·cs.CL·November 4, 2025

Large Language Models as Medical Codes Selectors: a benchmark using the International Classification of Primary Care

Vinicius Anjos de Almeida, Vinicius de Camargo, Raquel G\'omez-Bravo, Egbert van der Haring, Kees van Boven, Marcelo Finger, Luis Fernandez Lopez

PDF

Open Access

TL;DR

This study benchmarks large language models' ability to assign ICPC-2 medical codes from clinical expressions, demonstrating high accuracy and identifying key performance factors without model fine-tuning.

Contribution

It introduces a benchmark for LLMs in medical coding using a domain-specific search engine and evaluates multiple models' performance on Brazilian Portuguese clinical data.

Findings

01

Most models achieved F1-score > 0.8

02

Top models include gpt-4.5-preview, o3, and gemini-2.5-pro

03

Retriever optimization improves performance by up to 4 points

Abstract

Background: Medical coding structures healthcare data for research, quality monitoring, and policy. This study assesses the potential of large language models (LLMs) to assign ICPC-2 codes using the output of a domain-specific search engine. Methods: A dataset of 437 Brazilian Portuguese clinical expressions, each annotated with ICPC-2 codes, was used. A semantic search engine (OpenAI's text-embedding-3-large) retrieved candidates from 73,563 labeled concepts. Thirty-three LLMs were prompted with each query and retrieved results to select the best-matching ICPC-2 code. Performance was evaluated using F1-score, along with token usage, cost, response time, and format adherence. Results: Twenty-eight models achieved F1-score > 0.8; ten exceeded 0.85. Top performers included gpt-4.5-preview, o3, and gemini-2.5-pro. Retriever optimization can improve performance by up to 4 points. Most…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMedical Coding and Health Information · Machine Learning in Healthcare · Artificial Intelligence in Healthcare and Education