TL;DR
This paper introduces ChronoScope, a benchmark for testing how well language models maintain and update temporal context over multiple dialogue turns, revealing significant stability issues.
Contribution
The paper presents a large-scale diagnostic benchmark for evaluating temporal scope stability in multi-turn language models, highlighting prevalent challenges in temporal reasoning.
Findings
Models often drift toward present-day assumptions despite correct knowledge.
Temporal stability issues increase with longer interactions.
Failures persist even with oracle context, indicating a gap in temporal reasoning.
Abstract
Language models are increasingly deployed in interactive settings where users reason about facts over time rather than in isolation. In such scenarios, correct behavior requires models to maintain and update implicit temporal assumptions established earlier in a conversation. We study this challenge through the lens of temporal scope stability: the ability to preserve, override, or transfer time-scoped factual context across dialogue turns. We introduce ChronoScope, a large-scale diagnostic benchmark designed to isolate temporal scope behavior in controlled multi-turn interactions, comprising over one million deterministically generated question chains grounded in Wikidata. ChronoScope evaluates whether models can correctly retain inferred temporal scope when follow-up questions omit explicit time references, spanning implicit carryover, explicit scope switching, cross-entity transfer,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
