Guardians of the data: NER and LLMs for effective medical record anonymization in Brazilian Portuguese
Mauricio Schiezaro, Guilherme Rosa, Bruno Augusto Goulart Campos, Helio Pedrini

TL;DR
This paper introduces AnonyMed-BR, a new dataset for anonymizing medical records in Brazilian Portuguese, and evaluates extractive and generative models for effective anonymization.
Contribution
AnonyMed-BR is the first manually annotated anonymization dataset for Brazilian Portuguese medical texts.
Findings
Both extractive and generative models achieved F1 scores above 0.90 in anonymizing sensitive entities.
Including synthetic data improved model generalization and task-specific fine-tuning outperformed biomedical pre-training.
The dataset and methodology enable privacy-preserving NLP research in Brazilian healthcare.
Abstract
The anonymization of medical records is essential to protect patient privacy while enabling the use of clinical data for research and Natural Language Processing (NLP) applications. However, for Brazilian Portuguese, the lack of publicly available and high-quality anonymized datasets limits progress in this area. In this study, we present AnonyMed-BR, a novel dataset of Brazilian medical records that includes both real and synthetic samples, manually annotated to identify personally identifiable information (PII) such as names, dates, locations, and healthcare identifiers. To benchmark our dataset and assess anonymization performance, we evaluate two anonymization strategies: (i) an extractive strategy based on Named Entity Recognition (NER) using BERT-based models, and (ii) a generative strategy using T5-based and GPT-based models to rewrite texts while masking sensitive entities. We…
Genes, proteins, chemicals, diseases, species, mutations and cell lines named across the full text — each resolved to its canonical identifier and authoritative record.
Click any figure to enlarge with its caption.
Figure 1Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMachine Learning in Healthcare · Topic Modeling · Data Quality and Management
