No-Human in the Loop: Agentic Evaluation at Scale for Recommendation

Tao Zhang; Kehui Yao; Luyi Ma; Jiao Chen; Reza Yousefi Maragheh; Kai Zhao; Jianpeng Xu; Evren Korpeoglu; Sushant Kumar; Kannan Achan

arXiv:2511.03051·cs.AI·November 6, 2025

No-Human in the Loop: Agentic Evaluation at Scale for Recommendation

Tao Zhang, Kehui Yao, Luyi Ma, Jiao Chen, Reza Yousefi Maragheh, Kai Zhao, Jianpeng Xu, Evren Korpeoglu, Sushant Kumar, Kannan Achan

PDF

Open Access

TL;DR

ScalingEval is a comprehensive benchmarking framework that evaluates large language models as judges for recommendation tasks, providing reproducible, consensus-driven comparisons across multiple models and categories.

Contribution

The paper introduces ScalingEval, a scalable, multi-agent evaluation protocol for systematically comparing LLMs as judges without human annotation.

Findings

01

Claude 3.5 Sonnet achieves highest confidence

02

Gemini 1.5 Pro performs best overall

03

GPT-4o offers optimal latency-accuracy-cost balance

Abstract

Evaluating large language models (LLMs) as judges is increasingly critical for building scalable and trustworthy evaluation pipelines. We present ScalingEval, a large-scale benchmarking study that systematically compares 36 LLMs, including GPT, Gemini, Claude, and Llama, across multiple product categories using a consensus-driven evaluation protocol. Our multi-agent framework aggregates pattern audits and issue codes into ground-truth labels via scalable majority voting, enabling reproducible comparison of LLM evaluators without human annotation. Applied to large-scale complementary-item recommendation, the benchmark reports four key findings: (i) Anthropic Claude 3.5 Sonnet achieves the highest decision confidence; (ii) Gemini 1.5 Pro offers the best overall performance across categories; (iii) GPT-4o provides the most favorable latency-accuracy-cost tradeoff; and (iv) GPT-OSS 20B…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsExplainable Artificial Intelligence (XAI) · Recommender Systems and Techniques · Topic Modeling