Large Language Models Lack Temporal Awareness of Medical Knowledge
Zihan Guan, Qiao Jin, Guangzhi Xiong, Fangyuan Chen, Mengxuan Hu, Qingyu Chen, Yifan Peng, Zhiyong Lu, Anil Vullikanti

TL;DR
This paper introduces TempoMed-Bench, a benchmark for evaluating how well Large Language Models understand and reason about the evolving nature of medical knowledge over time.
Contribution
It presents the first benchmark for assessing temporal awareness in medical LLMs and analyzes their performance and limitations in handling time-specific medical information.
Findings
LLMs' performance declines gradually over time, not abruptly.
Models struggle more with outdated knowledge than current information.
Predictions fluctuate irregularly across neighboring years.
Abstract
The existing methods for evaluating the medical knowledge of Large Language Models (LLMs) are largely based on atemporal examination-style benchmarks, while in reality, medical knowledge is inherently dynamic and continuously evolves as new evidence emerges and treatments are approved. Consequently, evaluating medical knowledge without a temporal context may provide an incomplete assessment of whether LLMs can accurately reason about time-specific medical knowledge. Moreover, most medical data are historical, requiring the models not only to recall the correct knowledge, but also to know when that knowledge is correct. To bridge the gap, we built TempoMed-Bench, the first-of-its-kind benchmark for evaluating the temporal awareness of the LLMs in the medical domain through evolving guideline knowledge. Based on the TempoMed-Bench, our evaluation analysis first reveals that LLMs lack…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
