# A Hybrid AI-based and Rule-based Approach to DICOM De-identification: A Solution for the MIDI-B Challenge

**Authors:** Hamideh Haghiri, Rajesh Baidya, Stefan Dvoretskii, Klaus H. Maier-Hein, Marco Nolden

arXiv: 2509.00437 · 2025-09-03

## TL;DR

This paper introduces a hybrid AI and rule-based framework for DICOM de-identification, achieving high accuracy by combining OCR, transformer models, and rule-based methods tailored to medical imaging data.

## Contribution

The paper presents a novel hybrid approach that integrates rule-based and AI models, specifically tailored for DICOM de-identification, improving accuracy and compliance in medical imaging data sharing.

## Key findings

- Achieved 99.91% de-identification accuracy on MIDI-B dataset.
- Combined rule-based and transformer models for effective PII removal.
- Refined approach by applying AI only to free text for better performance.

## Abstract

Ensuring the de-identification of medical imaging data is a critical step in enabling safe data sharing. This paper presents a hybrid de-identification framework designed to process Digital Imaging and Communications in Medicine (DICOM) files. Our framework adopts a modified, pre-built rule-based component, updated with The Cancer Imaging Archive (TCIA)'s best practices guidelines, as outlined in DICOM PS 3.15, for improved performance. It incorporates PaddleOCR, a robust Optical Character Recognition (OCR) system for extracting text from images, and RoBERTa, a fine-tuned transformer-based model for identifying and removing Personally Identifiable Information (PII) and Protected Health Information (PHI). Initially, the transformer-based model and the rule-based component were integrated to process for both structured data and free text. However, this coarse-grained approach did not yield optimal results. To improve performance, we refined our approach by applying the transformer model exclusively to free text, while structured data was handled only by rule-based methods. In this framework the DICOM validator dciodvfy was leveraged to ensure the integrity of DICOM files after the deID process. Through iterative refinement, including the incorporation of custom rules and private tag handling, the framework achieved a de-identification accuracy of 99.91% on the MIDI-B test dataset. The results demonstrate the effectiveness of combining rule-based compliance with AI-enabled adaptability in addressing the complex challenges of DICOM de-identification.

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/2509.00437/full.md

## Figures

3 figures with captions in the complete paper: https://tomesphere.com/paper/2509.00437/full.md

## References

19 references — full list in the complete paper: https://tomesphere.com/paper/2509.00437/full.md

---
Source: https://tomesphere.com/paper/2509.00437