OKBench: Democratizing LLM Evaluation with Fully Automated, On-Demand, Open Knowledge Benchmarking
Yanhong Li, Tianyang Xu, Kenan Tang, Karen Livescu, David McAllester, Jiawei Zhou

TL;DR
OKBench introduces an automated, dynamic benchmarking framework for evaluating LLMs on evolving knowledge, especially in the news domain, enabling more realistic and up-to-date assessments of model capabilities.
Contribution
This work presents a fully automated, on-demand benchmark generation system that captures evolving knowledge, democratizes evaluation, and enhances assessment of retrieval-augmented LLMs.
Findings
Retrieval narrows performance gaps between small and large models.
Models exhibit distinct behaviors when faced with new information.
OKBench enables evaluation on up-to-date, dynamic knowledge datasets.
Abstract
Knowledge-intensive question answering is central to large language models (LLMs) and is typically assessed using static benchmarks derived from sources like Wikipedia and textbooks. However, these benchmarks fail to capture evolving knowledge in a dynamic world, and centralized curation struggles to keep pace with rapid LLM advancements. To address these drawbacks, we propose Open Knowledge Bench (OKBench), a fully automated framework for generating high-quality, dynamic knowledge benchmarks on demand. Focusing on the news domain where knowledge updates daily, OKBench is an agentic framework that automates the sourcing, creation, validation, and distribution of benchmarks. Our approach democratizes benchmark creation and facilitates thorough evaluation of retrieval-augmented methods by reducing overlap with pretraining data. We evaluate our framework on a wide range open-source and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Expert finding and Q&A systems · Computational and Text Analysis Methods
