Memory-T1: Reinforcement Learning for Temporal Reasoning in Multi-session Agents

Yiming Du; Baojun Wang; Yifan Xiang; Zhaowei Wang; Wenyu Huang; Boyang Xue; Bin Liang; Xingshan Zeng; Fei Mi; Haoli Bai; Lifeng Shang; Jeff Z. Pan; Yuxin Jiang; Kam-Fai Wong

arXiv:2512.20092·cs.CL·December 24, 2025

Memory-T1: Reinforcement Learning for Temporal Reasoning in Multi-session Agents

Yiming Du, Baojun Wang, Yifan Xiang, Zhaowei Wang, Wenyu Huang, Boyang Xue, Bin Liang, Xingshan Zeng, Fei Mi, Haoli Bai, Lifeng Shang, Jeff Z. Pan, Yuxin Jiang, Kam-Fai Wong

PDF

Open Access 3 Reviews

TL;DR

Memory-T1 introduces a reinforcement learning framework with a temporal consistency reward to improve long-term temporal reasoning in multi-session dialogue agents, significantly enhancing performance and robustness.

Contribution

It presents a novel RL-based memory selection policy with a multi-level reward, including a temporal consistency component, for better reasoning over extended dialogue histories.

Findings

01

Achieves 67.0% on Time-Dialog benchmark, surpassing previous models.

02

Outperforms a 14B baseline by 10.2% in overall score.

03

Maintains robustness up to 128k tokens, where others fail.

Abstract

Temporal reasoning over long, multi-session dialogues is a critical capability for conversational agents. However, existing works and our pilot study have shown that as dialogue histories grow in length and accumulate noise, current long-context models struggle to accurately identify temporally pertinent information, significantly impairing reasoning performance. To address this, we introduce Memory-T1, a framework that learns a time-aware memory selection policy using reinforcement learning (RL). It employs a coarse-to-fine strategy, first pruning the dialogue history into a candidate set using temporal and relevance filters, followed by an RL agent that selects the precise evidence sessions. The RL training is guided by a multi-level reward function optimizing (i) answer accuracy, (ii) evidence grounding, and (iii) temporal consistency. In particular, the temporal consistency reward…

Peer Reviews

Decision·ICLR 2026 Poster

Reviewer 01Rating 4Confidence 3

Strengths

1. Timely and important: Tackles temporal reasoning over long multi-session dialogues—an increasingly central capability for agentic systems and memory architectures. 2. Well-motivated reward design for GRPO: Clear decomposition into Ra/Rg/Rt; Rt thoughtfully mixes proximity and fidelity with soft penalties and explicit hyper-parameters; weights and sensitivity are reported. 3. Coarse-to-fine retrieval is executed cleanly with a precise time filter then lexical ranking; the analysis of top-k a

Weaknesses

(W1) Table 2 interpretation of Rt/Rs/Rf effects The paper states that removing Rs or Rf yields a trade-off aligned with task difficulty (Category-A “simpler” tasks improve, while B/C “complex” tasks degrade). However the full removal of Rt (both Rs and Rf) does not show a monotone extension of the same trend. If Category-A improvements under −Rs/−Rf are due to “easier” temporal structure, why does removing the entire Rt lower Category-A below the full model instead of amplifying that benefit? T

Reviewer 02Rating 6Confidence 3

Strengths

The paper targets an important and underexplored aspect of dialogue modeling, namely temporal reasoning across multiple sessions. The proposed coarse-to-fine retrieval framework is intuitive and improves retrieval efficiency under long histories. Integrating reinforcement learning provides a structured way to optimize multiple supervision signals jointly. The experimental results demonstrate strong improvements on established benchmarks, and the ablation studies highlight the contribution of eac

Weaknesses

The methodological novelty is moderate. The framework largely combines existing reinforcement learning and retrieval techniques with additional temporal supervision. The temporal consistency reward is well-motivated but not particularly new, and its formulation appears heuristic. The paper does not provide enough justification for the chosen reward weights or a systematic analysis of their sensitivity. The ablation results are limited and do not clearly establish whether improvements stem from t

Reviewer 03Rating 4Confidence 3

Strengths

- The coarse-to-fine retrieval strategy combined with multi-level RL rewards is well-justified. The temporal consistency reward design enhanced the temporal reasoning and evidence selection of the model to better predict the answer. - Ablation studies (Table 2) provide clear evidence for each reward component's contribution, demonstrating the importance of individual design choices.

Weaknesses

- The primary evaluation is conducted on the in-domain Time-Dialog dataset, where Memory-T1 is trained in-domain while other baselines (Time-R1, MemAgent) are evaluated in a zero-shot setting. This makes direct comparison difficult. It remains unclear how Memory-T1 would perform against these baselines on unseen benchmarks such as LoCoMo, particularly in a comparable experimental setup. - The proposed framework requires extensive temporal annotations for training, yet the sensitivity to annotati

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Speech and dialogue systems · Multimodal Machine Learning Applications