# Multimodal identification of a rare head and neck cancer patient cohort in the clinical data warehouse of Greater Paris Teaching Hospital

**Authors:** A. La Rosa, M. Verdoux, P. Riebler, I. Lolli, C. Daniel, X. Tannier, S. Atallah, B. Baujat, E. Kempf

PMC · DOI: 10.1016/j.esmorw.2025.100151 · ESMO Real World Data and Digital Oncology · 2025-05-29

## TL;DR

The study developed a method to identify rare head and neck cancer patients using a combination of structured and free-text data from electronic health records.

## Contribution

A novel multimodal algorithm integrating ICD-10 codes, ADICAP codes, and NLP of free-text pathology reports to identify rare HNC patients in a clinical data warehouse.

## Key findings

- 4515 patients were classified as rare HNC using ICD-10, ADICAP, or NLP.
- 2168 patients were identified by at least two data sources with 91% sensitivity and 95% specificity.
- NLP showed high sensitivity but had a 9% false positive rate.

## Abstract

Ten percent of head and neck cancers (HNCs) differ from the common upper aerodigestive tract squamous-cell carcinoma. These rare HNCs can be rare because of their histology or anatomical location. The federation of clinical data warehouses (CDWs) holds potential for advancing our understanding of these pathologies. This study aimed to develop a multimodal algorithm to identify rare HNC patients in a CDW.

We carried out a cross-sectional study on the CDW of a conglomerate of 38 university hospitals. We developed a multimodal classification algorithm to identify rare HNC patients by integrating International Classification of Diseases, 10th revision (ICD-10) codes, Association for the Development of Computer Science in Cytology and Pathological Anatomy (ADICAP) codes and free-text data from pathology reports using natural language processing (NLP). Algorithm performance was evaluated by an HNC medical expert using a validation set of 100 manually annotated cases.

Of 333 852 cancer patients, 9141 were identified as HNC patients based on ICD-10 and ADICAP codes. The multimodal algorithm using ICD-10 or ADICAP codes or NLP-processed free text classified 4515 patients as rare HNC patients, with 2168 identified by a minimum of two data sources. It showed a 91% sensitivity and a 95% specificity when relying on multiple data sources, with a 76% positive predictive value observed for rare histology identification compared with 43% for rare topography.

This study demonstrates the feasibility and utility of a multimodal electronic health record-based approach to identify rare HNC patients in a CDW. Incorporating free-text and structured data improves the reliability of such cohort identification.

•Three data sources were used to identify rare HNC patients in a CDW.•A total of 4515 cancer patients were classified as having rare HNC via ICD-10, ADICAP or NLP.•A subset of 2168 patients was identified by at least two data sources.•When relying on multiple data sources, 91% sensitivity and 95% specificity were reached.•NLP performed well in terms of sensitivity, but showed a 9% rate of false positives.

Three data sources were used to identify rare HNC patients in a CDW.

A total of 4515 cancer patients were classified as having rare HNC via ICD-10, ADICAP or NLP.

A subset of 2168 patients was identified by at least two data sources.

When relying on multiple data sources, 91% sensitivity and 95% specificity were reached.

NLP performed well in terms of sensitivity, but showed a 9% rate of false positives.

## Full-text entities

- **Diseases:** upper aerodigestive tract squamous-cell carcinoma (MESH:D002294), HNCs (MESH:D006258), cancer (MESH:D009369)
- **Species:** Homo sapiens (human, species) [taxon 9606]

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/PMC12836562/full.md

## Figures

4 figures with captions in the complete paper: https://tomesphere.com/paper/PMC12836562/full.md

## References

21 references — full list in the complete paper: https://tomesphere.com/paper/PMC12836562/full.md

---
Source: https://tomesphere.com/paper/PMC12836562