AutoMetrics: Approximate Human Judgements with Automatically Generated Evaluators

Michael J. Ryan; Yanzhe Zhang; Amol Salunkhe; Yi Chu; Di Xu; Diyi Yang

arXiv:2512.17267·cs.CL·December 22, 2025

AutoMetrics: Approximate Human Judgements with Automatically Generated Evaluators

Michael J. Ryan, Yanzhe Zhang, Amol Salunkhe, Yi Chu, Di Xu, Diyi Yang

PDF

Open Access

TL;DR

AutoMetrics is a framework that synthesizes automatic evaluation metrics for AI applications by combining curated metrics and lightweight human feedback, significantly improving correlation with human judgments in diverse tasks.

Contribution

AutoMetrics introduces a novel approach to approximate human evaluations using automatically generated evaluators and a curated metric bank, reducing reliance on extensive human feedback.

Findings

01

Improves Kendall correlation with human ratings by up to 33.4%.

02

Requires fewer than 100 feedback points for effective evaluation.

03

Applicable across diverse AI tasks with enhanced accuracy.

Abstract

Evaluating user-facing AI applications remains a central challenge, especially in open-ended domains such as travel planning, clinical note generation, or dialogue. The gold standard is user feedback (e.g., thumbs up/down) or behavioral signals (e.g., retention), but these are often scarce in prototypes and research projects, or too-slow to use for system optimization. We present AutoMetrics, a framework for synthesizing evaluation metrics under low-data constraints. AutoMetrics combines retrieval from MetricBank, a collection of 48 metrics we curate, with automatically generated LLM-as-a-Judge criteria informed by lightweight human feedback. These metrics are composed via regression to maximize correlation with human signal. AutoMetrics takes you from expensive measures to interpretable automatic metrics. Across 5 diverse tasks, AutoMetrics improves Kendall correlation with human…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Artificial Intelligence in Healthcare and Education · Topic Modeling