QQJ: Quantifying Qualitative Judgment for Scalable and Human-Aligned Evaluation of Generative AI
Marjan Veysi, Pirooz Shamsinejadbabaki, Mohammad Zare, Mohammad Sabouri

TL;DR
QQJ is a scalable evaluation framework for generative AI that aligns automated assessments with human judgment by using expert-designed rubrics and calibrated language models.
Contribution
Introduces QQJ, a novel evaluation method that bridges human judgment and automation through structured rubrics and calibration, improving reliability and interpretability.
Findings
QQJ outperforms traditional metrics in aligning with human judgment.
QQJ shows better stability and diagnostic capability across tasks.
QQJ effectively identifies critical failure modes like hallucination.
Abstract
The rapid progress of generative artificial intelligence has exposed fundamental limitations in existing evaluation methodologies, particularly for open-ended, creative, and human-facing tasks. Traditional automatic metrics rely on surface-level statistical similarity and often fail to reflect human perceptions of quality, while purely human evaluation, although reliable, is costly, subjective, and difficult to scale. Recent approaches using large language models as evaluators offer improved scalability but frequently lack explicit grounding in human-defined evaluation principles, leading to bias and inconsistency. In this paper, we introduce Quantifying Qualitative Judgment (QQJ), a scalable and human-centric evaluation framework that explicitly bridges the gap between human judgment and automated assessment. QQJ separates the definition of quality from its execution by anchoring…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
