# Transforming a clinical study database into a structured database adapted to artificial intelligence applications

**Authors:** Thibault Sauron, Carole Lazarus, Camille Kurtz, Florence Cloppet, Isabelle Thomassin Naggara, Edouard Poncelet, Edouard Poncelet, Aurelie Jalaguier-Coudray, Ingrid Millet, Valerie Juhan, Corinne Balleyguier, Caroline Malhaire, Nicolas Perrot, Marc Bazot, Patrice Taourel, Emile Darai, Laure Fournier

PMC · DOI: 10.1186/s13244-025-02087-2 · 2026-02-16

## TL;DR

This paper presents a method to convert clinical trial MRI data into a structured database suitable for training AI models.

## Contribution

The paper introduces a novel curation methodology and open-source tools for adapting clinical trial data for AI applications.

## Key findings

- A curation process was developed to simplify and harmonize clinical trial MRI data for AI use.
- The number of files and folders was significantly reduced, and only essential DICOM fields were retained.
- Quality control and harmonization steps improved data consistency for AI model training.

## Abstract

Medical imaging databases suitable for training machine learning/computer vision algorithms are scarce, limiting the potential for development and generalisation of clinical tools. Clinical trial databases are a source of data, known for their high-quality data and reliable annotations. However, they are not tailored to the needs of machine learning or deep learning models. Our objective was to develop a methodology and tools that enable the curation of these databases specifically for the training or testing of artificial intelligence tools.

MRIs from the French centres of the EURAD clinical trial (MRI of women with pelvic adnexal lesions) were used to constitute the database. We developed the steps required to curate a clinical trial database: definition of inclusion and exclusion criteria, removal of unnecessary data according to the principle of parsimony, quality control, and harmonisation.

A total of 713 patients were included in our study. The directory structure was simplified, and the number of files and folders decreased by 44% and 95% respectively. Only 62 DICOM fields were considered necessary for artificial intelligence (AI) model applications. Quality control was implemented in repeated cycles of automatic checks, followed by a final manual random inspection. Finally, sequence names were harmonised for easy identification when developing models.

Using a clinical trial database, we propose a methodology to build a database suitable to train or test AI algorithms. This study underlines the need for a more global and systematic framework for the secondary use of health data to develop AI imaging tools for patient care.

We propose and detail a framework and tools to curate a clinical trial database to allow secondary use of the high-quality annotated data generated in clinical trials for the training and testing of artificial intelligence models.

Clinical trial imaging databases are not adapted for AI model development.A curation process of MRI databases was developed for machine learning applications.We share the open-source tools and methodology developed in this study.

Clinical trial imaging databases are not adapted for AI model development.

A curation process of MRI databases was developed for machine learning applications.

We share the open-source tools and methodology developed in this study.

## Full-text entities

- **Diseases:** Brain tumor (MESH:D001932), ovarian masses (MESH:D010049), HIPAA (OMIM:603663), Adnexal torsions (MESH:D000082843), epithelial tumours (MESH:D009375), MR (MESH:D008944), O-RADS (MESH:C535508), adnexal (MESH:D000292), ML (MESH:D007859), AI (MESH:C538142), adnexal lesions (MESH:D000291), Cancer (MESH:D009369), Inflammatory disease (MESH:D007249)
- **Chemicals:** water (MESH:D014867)
- **Species:** Homo sapiens (human, species) [taxon 9606]

## Figures

5 figures with captions in the complete paper: https://tomesphere.com/paper/PMC12907282/full.md

---
Source: https://tomesphere.com/paper/PMC12907282