Aligning Model Evaluations with Human Preferences: Mitigating Token   Count Bias in Language Model Assessments

Roland Daynauth; Jason Mars

arXiv:2407.12847·cs.CL·July 19, 2024

Aligning Model Evaluations with Human Preferences: Mitigating Token Count Bias in Language Model Assessments

Roland Daynauth, Jason Mars

PDF

Open Access

TL;DR

This paper addresses token count bias in language model evaluations by developing a recalibration method for LLM evaluators, significantly improving their alignment with human preferences across multiple use cases.

Contribution

It introduces a Bayesian-based recalibration procedure to mitigate token count bias in LLM evaluators, enhancing their alignment with human judgments.

Findings

01

Spearman's correlation improved from -27.27 to 44.55 in recommendations.

02

Recalibration significantly reduces token count bias.

03

Automated evaluators become more reliable and human-aligned.

Abstract

The SLAM paper demonstrated that on-device Small Language Models (SLMs) are a viable and cost-effective alternative to API-based Large Language Models (LLMs), such as OpenAI's GPT-4, offering comparable performance and stability. However, SLAM also identified discrepancies between human preferences and traditional auto-evaluators. This follow-up paper explores methods to align LLM evaluator preferences with human evaluations by addressing biases, particularly toward higher token counts. We employed Bayesian statistics and a t-test to quantify this bias and developed a recalibration procedure to adjust the GPTScorer. Our findings significantly improve aligning the recalibrated LLM evaluator with human evaluations across multiple use cases. For instance, spearman's ranking correlation score in the Recommendation use case improved from -27.27 to 44.55. These results highlight the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling

MethodsAttention Is All You Need · Byte Pair Encoding · Layer Normalization · Label Smoothing · Linear Layer · Softmax · Position-Wise Feed-Forward Layer · Absolute Position Encodings · Multi-Head Attention · Dense Connections