ChronoPlay: A Framework for Modeling Dual Dynamics and Authenticity in Game RAG Benchmarks

Liyang He; Yuren Zhang; Ziwei Zhu; Zhenghui Li; Shiwei Tong

arXiv:2510.18455·cs.CL·October 22, 2025

ChronoPlay: A Framework for Modeling Dual Dynamics and Authenticity in Game RAG Benchmarks

Liyang He, Yuren Zhang, Ziwei Zhu, Zhenghui Li, Shiwei Tong

PDF

Open Access 1 Datasets 3 Reviews

TL;DR

ChronoPlay introduces a novel framework for creating dynamic, authentic game RAG benchmarks by modeling dual content and community dynamics, enabling more realistic evaluation of retrieval-augmented systems in gaming.

Contribution

It presents the first automated framework for continuous, dual-dynamic game RAG benchmarks, integrating content updates and community authenticity.

Findings

01

First dynamic RAG benchmark for gaming created

02

Framework effectively tracks content and community changes

03

Benchmark reveals model performance under realistic conditions

Abstract

Retrieval Augmented Generation (RAG) systems are increasingly vital in dynamic domains like online gaming, yet the lack of a dedicated benchmark has impeded standardized evaluation in this area. The core difficulty lies in Dual Dynamics: the constant interplay between game content updates and the shifting focus of the player community. Furthermore, the necessity of automating such a benchmark introduces a critical requirement for player-centric authenticity to ensure generated questions are realistic. To address this integrated challenge, we introduce ChronoPlay, a novel framework for the automated and continuous generation of game RAG benchmarks. ChronoPlay utilizes a dual-dynamic update mechanism to track both forms of change, and a dual-source synthesis engine that draws from official sources and player community to ensure both factual correctness and authentic query patterns. We…

Peer Reviews

Decision·ICLR 2026 Poster

Reviewer 01Rating 6Confidence 4

Strengths

1. ChronoPlay is the first framework to incorporate dynamic, evolving environments into RAG evaluation for gaming. By formalizing knowledge evolution and user interest drift, it highlights two important factors often overlooked in static RAG benchmarks. 2. The experiments are extensive, covering several retrievers and generators across three games and including ablations on knowledge and interest dynamics. 3. The proposed datasets and methodology could help the community study how RAG systems

Weaknesses

1. While the paper includes a diverse set of retrievers and generators, the RAG experiments themselves are static. It's unclear whether the RAG system re-index as the benchmark evolves. There is no adaptive RAG system (e.g. with finetuning, memory) discussed. It's unclear how the proposed dynamic benchmark would challenge or benefit adaptive RAG systems in practice. 2. The paper also omits details about how the vector databases are indexed or re-indexed after each phase. It is unclear whether e

Reviewer 02Rating 6Confidence 2

Strengths

- The concept of "Dual Dynamics" is a powerful contribution. The paper addresses a limitation of existing dynamic benchmarks and provides a more realistic evaluation paradigm. - The framework this paper proposes is well-designed.. Human expert evaluation and validation of LLM-as-judge against human experts provide credibility. It is also instantiated on multiple games which shows diversity.

Weaknesses

- The ChronoPlay pipeline involves multiple LLM-driven stages and several hyperparameters (λ_JSD=0.001, γ=1.5, varying window sizes W). This complexity might pose barriers to easy adoption. - The statistical rigor of the evaluation could be strengthened. The paper lacks confidence intervals or significance tests for performance differences. Are the fluctuations statistically significant?

Reviewer 03Rating 4Confidence 4

Strengths

1. This paper studies the RAG problems in dynamic domains and build a benchmark in this area with an automated and continuous generation framework. The research topic is interesting and valuable. 2. The data sourced from real games and player communities making the benchmark more realistic.

Weaknesses

1. Many real-world applications require dynamic RAG systems, including online shopping (where prices and promotional campaigns constantly change) and travel planning (where weather conditions and seasonal attractions vary). However, the benchmark's coverage of only three games limits its scope and makes the data domain insufficiently diverse. 2. The generation performance in this work is evaluated using LLM-as-Judge, as detailed in Appendix C. However, the meta-evaluation results (Section C.3)

Code & Models

Datasets

leoner24/ChronoPlay-QA
dataset· 59 dl
59 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsArtificial Intelligence in Games · Educational Games and Gamification · Advanced Graph Neural Networks