# Artificial Intelligence Models for Predicting Triage in Emergency Departments: Seven-Month Retrospective Comparative Study of Natural Language Processing, Large Language Model, and Joint Embedding Predictive Architectures

**Authors:** Edouard Lansiaux, Ramy Azzouz, Emmanuel Chazard, Amélie Vromant, Eric Wiel

PMC · DOI: 10.2196/83318 · 2026-03-10

## TL;DR

This study compares AI models for predicting emergency department triage and finds that a large language model performs best but has limitations due to overfitting and data bias.

## Contribution

The study introduces and evaluates three novel AI architectures for triage prediction, highlighting the potential and challenges of LLMs in clinical settings.

## Key findings

- URGENTIAPARSE, an LLM-based model, outperformed other AI models and nurse triage in predicting triage levels.
- The model showed high F1-score and AUC-ROC but suffered from overfitting and poor validation performance.
- The study highlights the need for external validation and bias mitigation before clinical deployment.

## Abstract

Triage errors in emergency departments (EDs), including undertriage and overtriage, pose significant risks to patient safety and resource allocation. With increasing patient volumes and staffing challenges, artificial intelligence (AI) integration into triage protocols has gained attention as a potential solution.

This study aims to develop and compare 3 AI models—natural language processing (NLP), large language model (LLM), and Joint Embedding Predictive Architecture (JEPA)—for predicting triage outcomes according to the French Emergency Nurses Classification in Hospital (FRENCH) scale and to assess their performance relative to nurse triage and clinical expert consensus.

We conducted a retrospective analysis of prospectively collected data from adult patients triaged at Roger Salengro Hospital ED (Lille, France) over 7 months (June-December 2024). Three AI models were developed: TRIAGEMASTER (NLP with Doc2Vec + MLP), URGENTIAPARSE (LLM with FlauBERT + Extreme Gradient Boosting [XGBoost]), and EMERGINET (JEPA with variance-invariance-covariance regularization). Of 73,236 ED visits, 657 (0.90%) had complete audio recordings and structured data. Data were split 80:20 into training and validation sets with stratification. Gold-standard labels were established by senior clinician consensus (minimum 5 years of ED experience). The primary outcome was concordance with the gold-standard FRENCH triage level, assessed using weighted κ, Spearman correlation, F1-score, area under the receiver operating characteristic (AUC-ROC) curve, mean absolute error (MAE), and root mean square error (RMSE). Secondary analyses evaluated Groupes d’Etude Multicentrique des Services d’Accueil (GEMSA) prediction and performance by input data type.

URGENTIAPARSE demonstrated superior performance, with a composite z score of 2.514 compared with EMERGINET (0.438), TRIAGEMASTER (–3.511), and nurse triage (–4.343). URGENTIAPARSE achieved an F1-score of 0.900 (95% CI 0.876-0.924), an AUC-ROC of 0.879 (95% CI 0.851-0.907), a weighted κ of 0.800 (P<.001), a Spearman correlation of 0.802 (P<.001), an MAE of 0.228, and an RMSE of 0.790. Exact agreement was 90.0%, with near-agreement (+1 or –1 level) of 92.8%. However, training showed perfect accuracy (1.0) with poor validation performance (~0.5), indicating overfitting. EMERGINET achieved moderate performance (F1-score=0.731, AUC 0.686), while TRIAGEMASTER and nurse triage performed poorly (F1-score=0.618 and 0.303, respectively). For GEMSA prediction, URGENTIAPARSE maintained superiority (κ=0.863, Spearman=0.864, P<.001). Class 1 (highest acuity) was underrepresented (4/657, 0.61%), limiting undertriage risk assessment.

The LLM-based architecture (URGENTIAPARSE) demonstrated the highest accuracy for ED triage prediction among the tested models, outperforming traditional NLP, JEPA, and current nurse triage practices. However, severe overfitting, extreme selection bias (657/73,236, 0.90%, inclusion), a monocentric design, and sparse high-acuity representation limit clinical applicability. Before deployment, the model requires regularization, external validation across diverse EDs, prospective testing, and comprehensive safety evaluation, particularly for undertriage detection. Integration of AI triage support systems shows promise but demands rigorous validation, bias mitigation, and transparent uncertainty quantification to ensure patient safety.

## Full-text entities

- **Species:** Homo sapiens (human, species) [taxon 9606]

## Figures

13 figures with captions in the complete paper: https://tomesphere.com/paper/PMC13014074/full.md

---
Source: https://tomesphere.com/paper/PMC13014074