ChronoPlay: A Framework for Modeling Dual Dynamics and Authenticity in Game RAG Benchmarks
Liyang He, Yuren Zhang, Ziwei Zhu, Zhenghui Li, Shiwei Tong

TL;DR
ChronoPlay introduces a novel framework for creating dynamic, authentic game RAG benchmarks by modeling dual content and community dynamics, enabling more realistic evaluation of retrieval-augmented systems in gaming.
Contribution
It presents the first automated framework for continuous, dual-dynamic game RAG benchmarks, integrating content updates and community authenticity.
Findings
First dynamic RAG benchmark for gaming created
Framework effectively tracks content and community changes
Benchmark reveals model performance under realistic conditions
Abstract
Retrieval Augmented Generation (RAG) systems are increasingly vital in dynamic domains like online gaming, yet the lack of a dedicated benchmark has impeded standardized evaluation in this area. The core difficulty lies in Dual Dynamics: the constant interplay between game content updates and the shifting focus of the player community. Furthermore, the necessity of automating such a benchmark introduces a critical requirement for player-centric authenticity to ensure generated questions are realistic. To address this integrated challenge, we introduce ChronoPlay, a novel framework for the automated and continuous generation of game RAG benchmarks. ChronoPlay utilizes a dual-dynamic update mechanism to track both forms of change, and a dual-source synthesis engine that draws from official sources and player community to ensure both factual correctness and authentic query patterns. We…
Peer Reviews
Decision·ICLR 2026 Poster
1. ChronoPlay is the first framework to incorporate dynamic, evolving environments into RAG evaluation for gaming. By formalizing knowledge evolution and user interest drift, it highlights two important factors often overlooked in static RAG benchmarks. 2. The experiments are extensive, covering several retrievers and generators across three games and including ablations on knowledge and interest dynamics. 3. The proposed datasets and methodology could help the community study how RAG systems
1. While the paper includes a diverse set of retrievers and generators, the RAG experiments themselves are static. It's unclear whether the RAG system re-index as the benchmark evolves. There is no adaptive RAG system (e.g. with finetuning, memory) discussed. It's unclear how the proposed dynamic benchmark would challenge or benefit adaptive RAG systems in practice. 2. The paper also omits details about how the vector databases are indexed or re-indexed after each phase. It is unclear whether e
- The concept of "Dual Dynamics" is a powerful contribution. The paper addresses a limitation of existing dynamic benchmarks and provides a more realistic evaluation paradigm. - The framework this paper proposes is well-designed.. Human expert evaluation and validation of LLM-as-judge against human experts provide credibility. It is also instantiated on multiple games which shows diversity.
- The ChronoPlay pipeline involves multiple LLM-driven stages and several hyperparameters (λ_JSD=0.001, γ=1.5, varying window sizes W). This complexity might pose barriers to easy adoption. - The statistical rigor of the evaluation could be strengthened. The paper lacks confidence intervals or significance tests for performance differences. Are the fluctuations statistically significant?
1. This paper studies the RAG problems in dynamic domains and build a benchmark in this area with an automated and continuous generation framework. The research topic is interesting and valuable. 2. The data sourced from real games and player communities making the benchmark more realistic.
1. Many real-world applications require dynamic RAG systems, including online shopping (where prices and promotional campaigns constantly change) and travel planning (where weather conditions and seasonal attractions vary). However, the benchmark's coverage of only three games limits its scope and makes the data domain insufficiently diverse. 2. The generation performance in this work is evaluated using LLM-as-Judge, as detailed in Appendix C. However, the meta-evaluation results (Section C.3)
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsArtificial Intelligence in Games · Educational Games and Gamification · Advanced Graph Neural Networks
