Criterion Validity of LLM-as-Judge for Business Outcomes in Conversational Commerce
Liang Chen, Qi Liu, Wenhuan Lin, Feng Liang

TL;DR
This study evaluates whether multi-dimensional rubric-based dialogue assessments using LLMs are valid predictors of business outcomes in conversational commerce, revealing dimension heterogeneity and the importance of weighted scoring.
Contribution
It demonstrates the criterion validity of LLM-based dialogue evaluation, highlights the impact of rubric design, and proposes a new evaluation architecture for applied dialogue assessment.
Findings
Need Elicitation and Pacing Strategy dimensions are significantly associated with conversion.
Equal-weighted composite scores underperform compared to reweighted scores.
AI agents' sales behaviors often lack user trust, affecting conversion outcomes.
Abstract
Multi-dimensional rubric-based dialogue evaluation is widely used to assess conversational AI, yet its criterion validity -- whether quality scores are associated with the downstream outcomes they are meant to serve -- remains largely untested. We address this gap through a two-phase study on a major Chinese matchmaking platform, testing a 7-dimension evaluation rubric (implemented via LLM-as-Judge) against verified business conversion. Our findings concern rubric design and weighting, not LLM scoring accuracy: any judge using the same rubric would face the same structural issue. The core finding is dimension-level heterogeneity: in Phase 2 (n=60 human conversations, stratified sample, verified labels), Need Elicitation (D1: rho=0.368, p=0.004) and Pacing Strategy (D3: rho=0.354, p=0.006) are significantly associated with conversion after Bonferroni correction, while Contextual Memory…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
