Memory-T1: Reinforcement Learning for Temporal Reasoning in Multi-session Agents
Yiming Du, Baojun Wang, Yifan Xiang, Zhaowei Wang, Wenyu Huang, Boyang Xue, Bin Liang, Xingshan Zeng, Fei Mi, Haoli Bai, Lifeng Shang, Jeff Z. Pan, Yuxin Jiang, Kam-Fai Wong

TL;DR
Memory-T1 introduces a reinforcement learning framework with a temporal consistency reward to improve long-term temporal reasoning in multi-session dialogue agents, significantly enhancing performance and robustness.
Contribution
It presents a novel RL-based memory selection policy with a multi-level reward, including a temporal consistency component, for better reasoning over extended dialogue histories.
Findings
Achieves 67.0% on Time-Dialog benchmark, surpassing previous models.
Outperforms a 14B baseline by 10.2% in overall score.
Maintains robustness up to 128k tokens, where others fail.
Abstract
Temporal reasoning over long, multi-session dialogues is a critical capability for conversational agents. However, existing works and our pilot study have shown that as dialogue histories grow in length and accumulate noise, current long-context models struggle to accurately identify temporally pertinent information, significantly impairing reasoning performance. To address this, we introduce Memory-T1, a framework that learns a time-aware memory selection policy using reinforcement learning (RL). It employs a coarse-to-fine strategy, first pruning the dialogue history into a candidate set using temporal and relevance filters, followed by an RL agent that selects the precise evidence sessions. The RL training is guided by a multi-level reward function optimizing (i) answer accuracy, (ii) evidence grounding, and (iii) temporal consistency. In particular, the temporal consistency reward…
Peer Reviews
Decision·ICLR 2026 Poster
1. Timely and important: Tackles temporal reasoning over long multi-session dialogues—an increasingly central capability for agentic systems and memory architectures. 2. Well-motivated reward design for GRPO: Clear decomposition into Ra/Rg/Rt; Rt thoughtfully mixes proximity and fidelity with soft penalties and explicit hyper-parameters; weights and sensitivity are reported. 3. Coarse-to-fine retrieval is executed cleanly with a precise time filter then lexical ranking; the analysis of top-k a
(W1) Table 2 interpretation of Rt/Rs/Rf effects The paper states that removing Rs or Rf yields a trade-off aligned with task difficulty (Category-A “simpler” tasks improve, while B/C “complex” tasks degrade). However the full removal of Rt (both Rs and Rf) does not show a monotone extension of the same trend. If Category-A improvements under −Rs/−Rf are due to “easier” temporal structure, why does removing the entire Rt lower Category-A below the full model instead of amplifying that benefit? T
The paper targets an important and underexplored aspect of dialogue modeling, namely temporal reasoning across multiple sessions. The proposed coarse-to-fine retrieval framework is intuitive and improves retrieval efficiency under long histories. Integrating reinforcement learning provides a structured way to optimize multiple supervision signals jointly. The experimental results demonstrate strong improvements on established benchmarks, and the ablation studies highlight the contribution of eac
The methodological novelty is moderate. The framework largely combines existing reinforcement learning and retrieval techniques with additional temporal supervision. The temporal consistency reward is well-motivated but not particularly new, and its formulation appears heuristic. The paper does not provide enough justification for the chosen reward weights or a systematic analysis of their sensitivity. The ablation results are limited and do not clearly establish whether improvements stem from t
- The coarse-to-fine retrieval strategy combined with multi-level RL rewards is well-justified. The temporal consistency reward design enhanced the temporal reasoning and evidence selection of the model to better predict the answer. - Ablation studies (Table 2) provide clear evidence for each reward component's contribution, demonstrating the importance of individual design choices.
- The primary evaluation is conducted on the in-domain Time-Dialog dataset, where Memory-T1 is trained in-domain while other baselines (Time-R1, MemAgent) are evaluated in a zero-shot setting. This makes direct comparison difficult. It remains unclear how Memory-T1 would perform against these baselines on unseen benchmarks such as LoCoMo, particularly in a comparable experimental setup. - The proposed framework requires extensive temporal annotations for training, yet the sensitivity to annotati
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Speech and dialogue systems · Multimodal Machine Learning Applications
