SimEval-IR: A Unified Toolkit and Benchmark Suite for Evaluating User Simulators and Search Sessions
Saber Zerhoudi

TL;DR
SimEval-IR is an open-source toolkit and benchmark suite that standardizes evaluation of user simulators in information retrieval, distinguishing behavioral realism from tester reliability, and providing concrete metrics and baseline results.
Contribution
It introduces a unified session schema, three comprehensive benchmarks, and baseline results, addressing the lack of standardized evaluation tools for user simulators in IR.
Findings
Classifier-discriminator 'human-likeness' check has low predictive power for system ranking.
Distance metrics like click-depth and session embedding Fréchet distance are more effective.
SimEval-IR is publicly available with configurations and scripts for reproducibility.
Abstract
User simulators are increasingly central to interactive information retrieval, yet the community lacks standardized evaluation tools. Simulators serve two objectives, behavioral realism (matching real user behavior) and tester reliability (producing valid system rankings), and these are often conflated despite being distinct and sometimes conflicting. We present SimEval-IR, an open-source toolkit and benchmark suite that makes this distinction measurable. SimEval-IR provides: (1) a canonical session schema unifying session search and conversational interactions, with validated dataset adapters and explicit loss accounting; (2) three executable benchmarks covering behavioral realism, tester reliability with RATE-style estimation, and an analysis linking the two; and (3) baseline results across four real datasets in two languages and four simulator families. Our key finding: the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
