GrandJury: A Collaborative Machine Learning Model Evaluation Protocol for Dynamic Quality Rubrics
Arthur Cho

TL;DR
GrandJury proposes a novel, dynamic evaluation protocol for generative AI models that accounts for evolving user needs and contextual variability, moving beyond static benchmark tests.
Contribution
It introduces a comprehensive evaluation framework combining time decay, traceability, transparent rubrics, and multi-rater judgment, enabling pluralistic and accountable assessment of AI outputs.
Findings
Supports dynamic, context-aware evaluation of LLMs
Provides open-source tools and datasets for implementation
Captures evolving consensus and disagreement in model assessment
Abstract
Generative Machine Learning models have become central to modern systems, powering applications in creative writing, summarization, multi-hop reasoning, and context-aware dialogue. These models underpin large-scale AI assistants, workflow automation, and autonomous decision-making. In such domains, acceptable response is rarely absolute or static, but plural and highly context-dependent. Yet standard evaluation regimes still rely on static, benchmark-style tests, incentivizing optimization toward leaderboard scores rather than alignment with dynamic user needs or evolving realities. GrandJury introduces a formal evaluation protocol combining time-decayed aggregation, complete traceability, with the support of dynamic, transparent task rubric attribution, and multi-rater human judgment. Together, these elements enable pluralistic, accountable evaluation that captures evolving consensus…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
