LLM-as-a-Judge for Human-AI Co-Creation: A Reliability-Aware Evaluation Framework for Coding

Md Faizul Ibne Amin; Yutaka Watanobe; Daniel M. Muepu; Haruto Suzuki; Kenta Nanaumi; and Md Mostafizer Rahman

arXiv:2604.27727·cs.SE·May 1, 2026

LLM-as-a-Judge for Human-AI Co-Creation: A Reliability-Aware Evaluation Framework for Coding

Md Faizul Ibne Amin, Yutaka Watanobe, Daniel M. Muepu, Haruto Suzuki, Kenta Nanaumi, and Md Mostafizer Rahman

PDF

TL;DR

This paper introduces a reliability-aware, rubric-driven evaluation framework for assessing LLMs as judges and co-creators in AI-assisted coding, emphasizing multi-metric assessment and trajectory analysis.

Contribution

It presents a novel, reproducible evaluation methodology linking LLM judgment reliability with human-AI co-creation dynamics in coding tasks.

Findings

01

Co-creation success stabilizes early, with high turn-wise success rates.

02

Inter-judge agreement remains modest, indicating variability in evaluations.

03

Evaluation scores demonstrate measurable discrimination and decision quality.

Abstract

LLMs are increasingly employed both as judges for evaluating open-ended outputs and as co-creation partners in AI-assisted programming; yet rigorous evaluation in human-AI co-creation settings remains underdeveloped as judgments must be reliable, comparable across models, and interpretable over multi-turn interaction. To address this gap, a rubric-driven LLM-as-a-Judge framework is presented for contest-style human-AI co-creation in coding and software engineering (SE). The framework is built around schema-constrained judge outputs, validation and repair mechanisms, grouped and split by user and problem to prevent trajectory leakage, and participant-level NONBLIND context. Multiple LLM judges are assessed through a multi-metric protocol covering discrimination (ROC-AUC, PR-AUC), thresholded decision quality (MCC), probabilistic reliability (LogLoss, Brier score, ECE), and inter-judge…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.