LLM-as-a-Judge for Scalable Test Coverage Evaluation: Accuracy, Operational Reliability, and Cost

Donghao Huang; Shila Chew; Anna Dutkiewicz; Zhaoxia Wang

arXiv:2512.01232·cs.SE·December 2, 2025

LLM-as-a-Judge for Scalable Test Coverage Evaluation: Accuracy, Operational Reliability, and Cost

Donghao Huang, Shila Chew, Anna Dutkiewicz, Zhaoxia Wang

PDF

Open Access 1 Video

TL;DR

This paper introduces LLM-as-a-Judge (LAJ), a framework that uses large language models to evaluate software test coverage, analyzing accuracy, reliability, and cost across various models and configurations, and demonstrating that smaller models can outperform larger ones in cost-effectiveness and accuracy.

Contribution

The paper presents a comprehensive analysis of using LLMs as automated judges for test coverage evaluation, including a new metric (ECR@1), and shows that smaller models can be more efficient and accurate than larger models.

Findings

01

Smaller models like GPT-4o Mini outperform larger models in accuracy and cost.

02

Reliability varies from 85.4% to 100%, depending on model and configuration.

03

Cost per 1K evaluations ranges from $0.45 to $78.96.

Abstract

Assessing software test coverage at scale remains a bottleneck in QA pipelines. We present LLM-as-a-Judge (LAJ), a production-ready, rubric-driven framework for evaluating Gherkin acceptance tests with structured JSON outputs. Across 20 model configurations (GPT-4, GPT-5 with varying reasoning effort, and open-weight models) on 100 expert-annotated scripts over 5 runs (500 evaluations), we provide the first comprehensive analysis spanning accuracy, operational reliability, and cost. We introduce the Evaluation Completion Rate (ECR@1) to quantify first-attempt success, revealing reliability from 85.4% to 100.0% with material cost implications via retries. Results show that smaller models can outperform larger ones: GPT-4o Mini attains the best accuracy (6.07 MAAE), high reliability (96.6% ECR@1), and low cost ($1.01 per 1K), yielding a 78x cost reduction vs. GPT-5 (high reasoning) while…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

LLM-as-a-Judge for Scalable Test Coverage Evaluation: Accuracy, Operational Reliability, and Cost· underline

Taxonomy

TopicsSoftware Testing and Debugging Techniques · Software Engineering Research · Software System Performance and Reliability