Scalable and Reliable Evaluation of AI Knowledge Retrieval Systems: RIKER and the Coherent Simulated Universe
JV Roig

TL;DR
This paper introduces RIKER, a scalable, contamination-resistant evaluation framework for AI knowledge retrieval systems that uses synthetic document generation from known ground truth, enabling deterministic and scalable assessment.
Contribution
The paper presents RIKER, a novel benchmark and methodology that inverts the evaluation paradigm, allowing for scalable, contamination-free, and deterministic assessment of knowledge retrieval models.
Findings
Context length beyond 32K tokens degrades performance.
Cross-document aggregation is more challenging than single-document extraction.
Models can find facts but still hallucinate or fabricate information.
Abstract
Evaluating knowledge systems (LLMs, RAG, knowledge graphs, etc) faces fundamental challenges: static benchmarks are vulnerable to contamination, LLM-based judges exhibit systematic biases, and ground truth extraction requires expensive human annotation. We present RIKER (Retrieval Intelligence and Knowledge Extraction Rating), both a benchmark and a replicable methodology based on paradigm inversion - generating documents from known ground truth rather than extracting ground truth from documents. This approach enables deterministic scoring and scalable evaluation without human annotation or reference models, and contamination resistance through regenerable corpora. Our evaluation of 33 models using over 21 billion tokens reveals that context length claims frequently exceed usable capacity, with significant degradation beyond 32K tokens; cross-document aggregation proves substantially…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Graph Neural Networks · Topic Modeling · Biomedical Text Mining and Ontologies
