# ICD coding of death certificates with generative language models

**Authors:** Isabel Coutinho, Gonçalo M. Correia, Bruno Martins, Afonso Moreira, André Peralta-Santos

PMC · DOI: 10.1371/journal.pdig.0001245 · PLOS Digital Health · 2026-02-24

## TL;DR

This paper explores using generative language models to automate ICD coding of death certificates, showing promising results for public health applications.

## Contribution

The paper introduces a novel approach to ICD coding using generative language models with constrained decoding, avoiding fine-tuning for classification.

## Key findings

- Generative language models achieved classification accuracy comparable to encoder models for ICD coding.
- A single model can handle multiple coding tasks, such as underlying or multiple causes of death.
- The approach shows potential for real-life public health surveillance and reducing coder workload.

## Abstract

Although large language models can achieve remarkable results in most text generation tasks, these models have been less used in text classification problems, of which ICD coding of clinical documents is one example. In this work, we propose different strategies to adapt a LLaMA generative language model to the ICD coding task. In one such strategy, we only use a language modeling objective for training, followed by constrained decoding at inference time, rather than fine-tuning the model for discriminative classification. We specifically use free-text descriptions in Portuguese death certificates to train a relatively small LLaMA model for assigning ICD codes to the underlying cause of death, and we compare it against a BERT encoder model, which is typically used to address text classification tasks. Experiments show that generative language models can achieve strong results in ICD coding of death certificates, with a classification accuracy that is at least in line with the results obtained using encoder models. We thus demonstrate that language generation can be a suitable approach for ICD coding, allowing for multiple related tasks, such as coding the underlying or the multiple causes contributing for a death, to be performed with a single unified model.

The ICD coding system corresponds to a standardized way of classifying health conditions and external causes of injury or disease. Assigning these codes to causes of death is a critical task in public health, in order to monitor a population’s health and to conduct mortality and morbidity studies. However, manually coding death certificates is time-consuming and error-prone. Our work explores whether modern language models can help automate the ICD coding task effectively. We specifically frame ICD coding as a language generation problem, where human coders can interact with the model with a simple language prompt, allowing for different tasks to be performed with a single unified model, such as coding the underlying cause or multiple causes of death. We achieved strong experimental results, showing that this solution has indeed the potential to be applied in a real-life public health surveillance scenario, either in a setting where it is only necessary to obtain approximate mortality statistics, or in a setting where human coders can refine the automatically generated results, but considerably alleviating the burden to public health professionals.

## Full-text entities

- **Diseases:** diseases of the circulatory system (MESH:D012769), injury or disease (MESH:D004194), Influenza (MESH:D007251), respiratory diseases (MESH:D012140), ICD (MESH:D008310), Death (MESH:D003643), IRIS (MESH:C535535), ischemic heart diseases (MESH:D017202), COVID-19 (MESH:D000086382), neoplasms (MESH:D009369), Cerebrovascular diseases (MESH:D002561), pneumonia (MESH:D011014), overdose (MESH:D062787)
- **Chemicals:** BERT (-)
- **Species:** Homo sapiens (human, species) [taxon 9606]

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/PMC12931742/full.md

## Figures

5 figures with captions in the complete paper: https://tomesphere.com/paper/PMC12931742/full.md

## References

17 references — full list in the complete paper: https://tomesphere.com/paper/PMC12931742/full.md

---
Source: https://tomesphere.com/paper/PMC12931742