LLM-as-a-Judge for Scalable Test Coverage Evaluation: Accuracy, Operational Reliability, and Cost
Donghao Huang, Shila Chew, Anna Dutkiewicz, Zhaoxia Wang

TL;DR
This paper introduces LLM-as-a-Judge (LAJ), a framework that uses large language models to evaluate software test coverage, analyzing accuracy, reliability, and cost across various models and configurations, and demonstrating that smaller models can outperform larger ones in cost-effectiveness and accuracy.
Contribution
The paper presents a comprehensive analysis of using LLMs as automated judges for test coverage evaluation, including a new metric (ECR@1), and shows that smaller models can be more efficient and accurate than larger models.
Findings
Smaller models like GPT-4o Mini outperform larger models in accuracy and cost.
Reliability varies from 85.4% to 100%, depending on model and configuration.
Cost per 1K evaluations ranges from $0.45 to $78.96.
Abstract
Assessing software test coverage at scale remains a bottleneck in QA pipelines. We present LLM-as-a-Judge (LAJ), a production-ready, rubric-driven framework for evaluating Gherkin acceptance tests with structured JSON outputs. Across 20 model configurations (GPT-4, GPT-5 with varying reasoning effort, and open-weight models) on 100 expert-annotated scripts over 5 runs (500 evaluations), we provide the first comprehensive analysis spanning accuracy, operational reliability, and cost. We introduce the Evaluation Completion Rate (ECR@1) to quantify first-attempt success, revealing reliability from 85.4% to 100.0% with material cost implications via retries. Results show that smaller models can outperform larger ones: GPT-4o Mini attains the best accuracy (6.07 MAAE), high reliability (96.6% ECR@1), and low cost ($1.01 per 1K), yielding a 78x cost reduction vs. GPT-5 (high reasoning) while…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsSoftware Testing and Debugging Techniques · Software Engineering Research · Software System Performance and Reliability
