Qworld: Question-Specific Evaluation Criteria for LLMs

Shanghua Gao; Yuchang Su; Pengwei Sui; Curtis Ginder; Marinka Zitnik

arXiv:2603.23522·cs.CL·March 26, 2026

Qworld: Question-Specific Evaluation Criteria for LLMs

Shanghua Gao, Yuchang Su, Pengwei Sui, Curtis Ginder, Marinka Zitnik

PDF

Open Access 1 Datasets

TL;DR

Qworld introduces a hierarchical, question-specific evaluation method for LLMs that generates detailed criteria tailored to each question, improving the assessment of nuanced capabilities.

Contribution

It presents a novel recursive expansion tree approach to generate tailored evaluation criteria for individual questions, enhancing the granularity and relevance of LLM assessments.

Findings

01

Covers 89% of expert criteria on HealthBench

02

Generates 79% novel, validated criteria

03

Reveals capability differences across models in nuanced dimensions

Abstract

Evaluating large language models (LLMs) on open-ended questions is difficult because response quality depends on the question's context. Binary scores and static rubrics fail to capture these context-dependent requirements. Existing methods define criteria at the dataset level or generate them in a single pass, which limits their ability to explore the evaluation space implied by each question. We introduce One-Question-One-World (Qworld), a method that generates question-specific evaluation criteria using a recursive expansion tree. Given a question, Qworld decomposes it into scenarios, perspectives, and fine-grained binary criteria through structured hierarchical and horizontal expansion. The resulting criteria specify what a high-quality answer must address for that question. On HealthBench, Qworld covers 89% of expert-authored criteria and generates 79% novel criteria validated by…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

suyc21/qworld
dataset· 32 dl
32 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsArtificial Intelligence in Healthcare and Education · Topic Modeling · Computational and Text Analysis Methods