Assessing and Mitigating Medical Knowledge Drift and Conflicts in Large Language Models

Weiyi Wu; Xinwen Xu; Chongyang Gao; Xingjian Diao; Siting Li; Lucas A. Salas; Jiang Gui

arXiv:2505.07968·cs.CL·September 9, 2025

Assessing and Mitigating Medical Knowledge Drift and Conflicts in Large Language Models

Weiyi Wu, Xinwen Xu, Chongyang Gao, Xingjian Diao, Siting Li, Lucas A. Salas, Jiang Gui

PDF

1 Datasets 1 Video

TL;DR

This paper evaluates how large language models in healthcare handle evolving medical knowledge, revealing challenges in maintaining accuracy over time and proposing mitigation strategies to improve their reliability.

Contribution

Introduces the DriftMedQA benchmark to simulate medical guideline evolution and assesses mitigation strategies like retrieval augmentation and preference fine-tuning.

Findings

01

Models struggle with outdated and conflicting medical advice.

02

Mitigation strategies improve model reliability.

03

Combined methods yield the most consistent results.

Abstract

Large Language Models (LLMs) have great potential in the field of health care, yet they face great challenges in adapting to rapidly evolving medical knowledge. This can lead to outdated or contradictory treatment suggestions. This study investigated how LLMs respond to evolving clinical guidelines, focusing on concept drift and internal inconsistencies. We developed the DriftMedQA benchmark to simulate guideline evolution and assessed the temporal reliability of various LLMs. Our evaluation of seven state-of-the-art models across 4,290 scenarios demonstrated difficulties in rejecting outdated recommendations and frequently endorsing conflicting guidance. Additionally, we explored two mitigation strategies: Retrieval-Augmented Generation and preference fine-tuning via Direct Preference Optimization. While each method improved model performance, their combination led to the most…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

RDBH/DriftMed
dataset· 26 dl
26 dl

Videos

Assessing and Mitigating Medical Knowledge Drift and Conflicts in Large Language Models· underline