Temporal Fact Conflicts in LLMs: Reproducibility Insights from Unifying DYNAMICQA and MULAN
Ritajit Dey, Iadh Ounis, Graham McDonald, Yashar Moshfeghi

TL;DR
This study compares two datasets on how external context affects temporal fact conflicts in LLMs, revealing dataset dependence, the influence of model size, and the importance of evaluation design.
Contribution
It reproduces and compares experiments from DYNAMICQA and MULAN, analyzing dataset effects and model size influence on temporal fact updating in LLMs.
Findings
MULAN's findings generalize across frameworks
Dataset design significantly impacts results
Model size affects temporal fact encoding and updating
Abstract
Large Language Models (LLMs) often struggle with temporal fact conflicts due to outdated or evolving information in their training data. Two recent studies with accompanying datasets report opposite conclusions on whether external context can effectively resolve such conflicts. DYNAMICQA evaluates how effective external context is in shifting the model's output distribution, finding that temporal facts are more resistant to change. In contrast, MULAN examines how often external context changes memorised facts, concluding that temporal facts are easier to update. In this reproducibility paper, we first reproduce experiments from both benchmarks. We then reproduce the experiments of each study on the dataset of the other to investigate the source of their disagreement. To enable direct comparison of findings, we standardise both datasets to align with the evaluation settings of each…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Language and cultural evolution · Computational and Text Analysis Methods
