Qworld: Question-Specific Evaluation Criteria for LLMs
Shanghua Gao, Yuchang Su, Pengwei Sui, Curtis Ginder, Marinka Zitnik

TL;DR
Qworld introduces a hierarchical, question-specific evaluation method for LLMs that generates detailed criteria tailored to each question, improving the assessment of nuanced capabilities.
Contribution
It presents a novel recursive expansion tree approach to generate tailored evaluation criteria for individual questions, enhancing the granularity and relevance of LLM assessments.
Findings
Covers 89% of expert criteria on HealthBench
Generates 79% novel, validated criteria
Reveals capability differences across models in nuanced dimensions
Abstract
Evaluating large language models (LLMs) on open-ended questions is difficult because response quality depends on the question's context. Binary scores and static rubrics fail to capture these context-dependent requirements. Existing methods define criteria at the dataset level or generate them in a single pass, which limits their ability to explore the evaluation space implied by each question. We introduce One-Question-One-World (Qworld), a method that generates question-specific evaluation criteria using a recursive expansion tree. Given a question, Qworld decomposes it into scenarios, perspectives, and fine-grained binary criteria through structured hierarchical and horizontal expansion. The resulting criteria specify what a high-quality answer must address for that question. On HealthBench, Qworld covers 89% of expert-authored criteria and generates 79% novel criteria validated by…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsArtificial Intelligence in Healthcare and Education · Topic Modeling · Computational and Text Analysis Methods
