Beyond Probabilities: Unveiling the Misalignment in Evaluating Large   Language Models

Chenyang Lyu; Minghao Wu; Alham Fikri Aji

arXiv:2402.13887·cs.CL·July 10, 2024·1 cites

Beyond Probabilities: Unveiling the Misalignment in Evaluating Large Language Models

Chenyang Lyu, Minghao Wu, Alham Fikri Aji

PDF

Open Access

TL;DR

This paper critically examines the common probability-based evaluation methods for large language models, revealing their misalignment with actual generative performance and highlighting the need for more accurate assessment strategies.

Contribution

The study empirically demonstrates the limitations of probability-based evaluation methods for LLMs in MCQ tasks and emphasizes the importance of generation-based assessments.

Findings

01

Probability-based evaluations do not align well with generative predictions

02

Current frameworks mainly assess LLMs through output probabilities

03

Probability methods inadequately reflect true model capabilities

Abstract

Large Language Models (LLMs) have demonstrated remarkable capabilities across various applications, fundamentally reshaping the landscape of natural language processing (NLP) research. However, recent evaluation frameworks often rely on the output probabilities of LLMs for predictions, primarily due to computational constraints, diverging from real-world LLM usage scenarios. While widely employed, the efficacy of these probability-based evaluation strategies remains an open research question. This study aims to scrutinize the validity of such probability-based evaluation methods within the context of using LLMs for Multiple Choice Questions (MCQs), highlighting their inherent limitations. Our empirical investigation reveals that the prevalent probability-based evaluation method inadequately aligns with generation-based prediction. Furthermore, current evaluation frameworks typically…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques