Putting HUMANS first: Efficient LAM Evaluation with Human Preference Alignment

Woody Haosheng Gan; William Held; Diyi Yang

arXiv:2605.00022·cs.CL·May 4, 2026

Putting HUMANS first: Efficient LAM Evaluation with Human Preference Alignment

Woody Haosheng Gan, William Held, Diyi Yang

PDF

TL;DR

This paper introduces HUMANS, a method using small, curated data subsets to efficiently evaluate large audio models, aligning benchmark scores with human preferences and outperforming full benchmarks in regression tasks.

Contribution

It proposes a novel subset selection approach that reduces evaluation costs while maintaining high correlation with full benchmarks and human preferences.

Findings

01

50-example subsets achieve over 0.93 correlation with full benchmark scores.

02

Regression models trained on curated subsets reach 0.98 correlation with human preferences.

03

HUMANS benchmark outperforms random subsets and full benchmarks in preference prediction.

Abstract

The rapid proliferation of large audio models (LAMs) demands efficient approaches for model comparison, yet comprehensive benchmarks are costly. To fill this gap, we investigate whether minimal subsets can reliably evaluate LAMs while reducing costs and data redundancy. Analyzing 10 subset selection methods with 18 audio models across 40 tasks covering major LAM evaluation dimensions, we show that subsets of just 50 examples (0.3% of data) can achieve over 0.93 Pearson correlation with full benchmark scores. To understand how well these scores align with what practitioners ultimately care about, user satisfaction, we collect 776 human preference ratings from realistic voice assistant conversations, finding that both subsets and full benchmark achieve only 0.85 correlation with human. To better predict preferences, we trained regression models on these selected subsets, achieving 0.98…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.