Improving Social Determinants of Health Documentation in French EHRs Using Large Language Models

Adrien Bazoge; Pac\^ome Constant dit Beaufils; Mohammed Hmitouch; Romain Bourcier; Emmanuel Morin; Richard Dufour; B\'eatrice Daille; Pierre-Antoine Gourraud; Matilde Karakachoff

arXiv:2507.03433·cs.CL·July 8, 2025

Improving Social Determinants of Health Documentation in French EHRs Using Large Language Models

Adrien Bazoge, Pac\^ome Constant dit Beaufils, Mohammed Hmitouch, Romain Bourcier, Emmanuel Morin, Richard Dufour, B\'eatrice Daille, Pierre-Antoine Gourraud, Matilde Karakachoff

PDF

TL;DR

This study demonstrates that large language models can effectively extract social determinants of health from French clinical notes, significantly improving data completeness compared to traditional structured EHR coding.

Contribution

The paper introduces a novel LLM-based approach for extracting SDoH categories from French clinical notes, with publicly available datasets and evaluation metrics.

Findings

01

High accuracy for well-documented SDoH categories (F1 > 0.80)

02

Model identified 95.8% of patients with at least one SDoH

03

Performance limited by annotation inconsistencies and language-specific issues

Abstract

Social determinants of health (SDoH) significantly influence health outcomes, shaping disease progression, treatment adherence, and health disparities. However, their documentation in structured electronic health records (EHRs) is often incomplete or missing. This study presents an approach based on large language models (LLMs) for extracting 13 SDoH categories from French clinical notes. We trained Flan-T5-Large on annotated social history sections from clinical notes at Nantes University Hospital, France. We evaluated the model at two levels: (i) identification of SDoH categories and associated values, and (ii) extraction of detailed SDoH with associated temporal and quantitative information. The model performance was assessed across four datasets, including two that we publicly release as open resources. The model achieved strong performance for identifying well-documented categories…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.