LLM-as-a-Judge for Human-AI Co-Creation: A Reliability-Aware Evaluation Framework for Coding
Md Faizul Ibne Amin, Yutaka Watanobe, Daniel M. Muepu, Haruto Suzuki, Kenta Nanaumi, and Md Mostafizer Rahman

TL;DR
This paper introduces a reliability-aware, rubric-driven evaluation framework for assessing LLMs as judges and co-creators in AI-assisted coding, emphasizing multi-metric assessment and trajectory analysis.
Contribution
It presents a novel, reproducible evaluation methodology linking LLM judgment reliability with human-AI co-creation dynamics in coding tasks.
Findings
Co-creation success stabilizes early, with high turn-wise success rates.
Inter-judge agreement remains modest, indicating variability in evaluations.
Evaluation scores demonstrate measurable discrimination and decision quality.
Abstract
LLMs are increasingly employed both as judges for evaluating open-ended outputs and as co-creation partners in AI-assisted programming; yet rigorous evaluation in human-AI co-creation settings remains underdeveloped as judgments must be reliable, comparable across models, and interpretable over multi-turn interaction. To address this gap, a rubric-driven LLM-as-a-Judge framework is presented for contest-style human-AI co-creation in coding and software engineering (SE). The framework is built around schema-constrained judge outputs, validation and repair mechanisms, grouped and split by user and problem to prevent trajectory leakage, and participant-level NONBLIND context. Multiple LLM judges are assessed through a multi-metric protocol covering discrimination (ROC-AUC, PR-AUC), thresholded decision quality (MCC), probabilistic reliability (LogLoss, Brier score, ECE), and inter-judge…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
