AutoMetrics: Approximate Human Judgements with Automatically Generated Evaluators
Michael J. Ryan, Yanzhe Zhang, Amol Salunkhe, Yi Chu, Di Xu, Diyi Yang

TL;DR
AutoMetrics is a framework that synthesizes automatic evaluation metrics for AI applications by combining curated metrics and lightweight human feedback, significantly improving correlation with human judgments in diverse tasks.
Contribution
AutoMetrics introduces a novel approach to approximate human evaluations using automatically generated evaluators and a curated metric bank, reducing reliance on extensive human feedback.
Findings
Improves Kendall correlation with human ratings by up to 33.4%.
Requires fewer than 100 feedback points for effective evaluation.
Applicable across diverse AI tasks with enhanced accuracy.
Abstract
Evaluating user-facing AI applications remains a central challenge, especially in open-ended domains such as travel planning, clinical note generation, or dialogue. The gold standard is user feedback (e.g., thumbs up/down) or behavioral signals (e.g., retention), but these are often scarce in prototypes and research projects, or too-slow to use for system optimization. We present AutoMetrics, a framework for synthesizing evaluation metrics under low-data constraints. AutoMetrics combines retrieval from MetricBank, a collection of 48 metrics we curate, with automatically generated LLM-as-a-Judge criteria informed by lightweight human feedback. These metrics are composed via regression to maximize correlation with human signal. AutoMetrics takes you from expensive measures to interpretable automatic metrics. Across 5 diverse tasks, AutoMetrics improves Kendall correlation with human…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Artificial Intelligence in Healthcare and Education · Topic Modeling
