Guardians of the data: NER and LLMs for effective medical record anonymization in Brazilian Portuguese

Mauricio Schiezaro; Guilherme Rosa; Bruno Augusto Goulart Campos; Helio Pedrini

PMC · DOI:10.3389/fpubh.2025.1717303·January 5, 2026

Guardians of the data: NER and LLMs for effective medical record anonymization in Brazilian Portuguese

Mauricio Schiezaro, Guilherme Rosa, Bruno Augusto Goulart Campos, Helio Pedrini

PDF

Open Access

TL;DR

This paper introduces AnonyMed-BR, a new dataset for anonymizing medical records in Brazilian Portuguese, and evaluates extractive and generative models for effective anonymization.

Contribution

AnonyMed-BR is the first manually annotated anonymization dataset for Brazilian Portuguese medical texts.

Findings

01

Both extractive and generative models achieved F1 scores above 0.90 in anonymizing sensitive entities.

02

Including synthetic data improved model generalization and task-specific fine-tuning outperformed biomedical pre-training.

03

The dataset and methodology enable privacy-preserving NLP research in Brazilian healthcare.

Abstract

The anonymization of medical records is essential to protect patient privacy while enabling the use of clinical data for research and Natural Language Processing (NLP) applications. However, for Brazilian Portuguese, the lack of publicly available and high-quality anonymized datasets limits progress in this area. In this study, we present AnonyMed-BR, a novel dataset of Brazilian medical records that includes both real and synthetic samples, manually annotated to identify personally identifiable information (PII) such as names, dates, locations, and healthcare identifiers. To benchmark our dataset and assess anonymization performance, we evaluate two anonymization strategies: (i) an extractive strategy based on Named Entity Recognition (NER) using BERT-based models, and (ii) a generative strategy using T5-based and GPT-based models to rewrite texts while masking sensitive entities. We…

Linked entities

Genes, proteins, chemicals, diseases, species, mutations and cell lines named across the full text — each resolved to its canonical identifier and authoritative record.

Species1

Homo sapiens(human · species)

Figures1

Click any figure to enlarge with its caption.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMachine Learning in Healthcare · Topic Modeling · Data Quality and Management