Encyclo-K: Evaluating LLMs with Dynamically Composed Knowledge Statements

Yiming Liang; Yizhi Li; Yantao Du; Ge Zhang; Jiayi Zhou; Yuchen Wu; Yinzhu Piao; Denghui Cao; Tong Sun; Ziniu Li; Li Du; Bo Lei; Jiaheng Liu; Chenghua Lin; Zhaoxiang Zhang; Wenhao Huang; Jiajun Zhang

arXiv:2512.24867·cs.CL·January 7, 2026

Encyclo-K: Evaluating LLMs with Dynamically Composed Knowledge Statements

Yiming Liang, Yizhi Li, Yantao Du, Ge Zhang, Jiayi Zhou, Yuchen Wu, Yinzhu Piao, Denghui Cao, Tong Sun, Ziniu Li, Li Du, Bo Lei, Jiaheng Liu, Chenghua Lin, Zhaoxiang Zhang, Wenhao Huang, Jiajun Zhang

PDF

Open Access 2 Datasets

TL;DR

Encyclo-K introduces a dynamic, statement-based benchmark for evaluating LLMs, addressing limitations of question-based benchmarks by enabling scalable, multi-knowledge assessment with reduced annotation costs.

Contribution

It proposes a novel statement-based benchmark that dynamically composes evaluation questions from authoritative knowledge statements, improving scalability and comprehensiveness.

Findings

01

Top LLMs achieve only around 62% accuracy on Encyclo-K.

02

Model performance varies significantly across different models.

03

Encyclo-K effectively challenges models with multi-statement, comprehensive questions.

Abstract

Benchmarks play a crucial role in tracking the rapid advancement of large language models (LLMs) and identifying their capability boundaries. However, existing benchmarks predominantly curate questions at the question level, suffering from three fundamental limitations: vulnerability to data contamination, restriction to single-knowledge-point assessment, and reliance on costly domain expert annotation. We propose Encyclo-K, a statement-based benchmark that rethinks benchmark construction from the ground up. Our key insight is that knowledge statements, not questions, can serve as the unit of curation, and questions can then be constructed from them. We extract standalone knowledge statements from authoritative textbooks and dynamically compose them into evaluation questions through random sampling at test time. This design directly addresses all three limitations: the combinatorial…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Artificial Intelligence in Healthcare and Education · Computational and Text Analysis Methods