GrandJury: A Collaborative Machine Learning Model Evaluation Protocol for Dynamic Quality Rubrics

Arthur Cho

arXiv:2508.02926·cs.LG·August 8, 2025

GrandJury: A Collaborative Machine Learning Model Evaluation Protocol for Dynamic Quality Rubrics

Arthur Cho

PDF

TL;DR

GrandJury proposes a novel, dynamic evaluation protocol for generative AI models that accounts for evolving user needs and contextual variability, moving beyond static benchmark tests.

Contribution

It introduces a comprehensive evaluation framework combining time decay, traceability, transparent rubrics, and multi-rater judgment, enabling pluralistic and accountable assessment of AI outputs.

Findings

01

Supports dynamic, context-aware evaluation of LLMs

02

Provides open-source tools and datasets for implementation

03

Captures evolving consensus and disagreement in model assessment

Abstract

Generative Machine Learning models have become central to modern systems, powering applications in creative writing, summarization, multi-hop reasoning, and context-aware dialogue. These models underpin large-scale AI assistants, workflow automation, and autonomous decision-making. In such domains, acceptable response is rarely absolute or static, but plural and highly context-dependent. Yet standard evaluation regimes still rely on static, benchmark-style tests, incentivizing optimization toward leaderboard scores rather than alignment with dynamic user needs or evolving realities. GrandJury introduces a formal evaluation protocol combining time-decayed aggregation, complete traceability, with the support of dynamic, transparent task rubric attribution, and multi-rater human judgment. Together, these elements enable pluralistic, accountable evaluation that captures evolving consensus…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.