Beyond Probabilities: Unveiling the Misalignment in Evaluating Large Language Models
Chenyang Lyu, Minghao Wu, Alham Fikri Aji

TL;DR
This paper critically examines the common probability-based evaluation methods for large language models, revealing their misalignment with actual generative performance and highlighting the need for more accurate assessment strategies.
Contribution
The study empirically demonstrates the limitations of probability-based evaluation methods for LLMs in MCQ tasks and emphasizes the importance of generation-based assessments.
Findings
Probability-based evaluations do not align well with generative predictions
Current frameworks mainly assess LLMs through output probabilities
Probability methods inadequately reflect true model capabilities
Abstract
Large Language Models (LLMs) have demonstrated remarkable capabilities across various applications, fundamentally reshaping the landscape of natural language processing (NLP) research. However, recent evaluation frameworks often rely on the output probabilities of LLMs for predictions, primarily due to computational constraints, diverging from real-world LLM usage scenarios. While widely employed, the efficacy of these probability-based evaluation strategies remains an open research question. This study aims to scrutinize the validity of such probability-based evaluation methods within the context of using LLMs for Multiple Choice Questions (MCQs), highlighting their inherent limitations. Our empirical investigation reveals that the prevalent probability-based evaluation method inadequately aligns with generation-based prediction. Furthermore, current evaluation frameworks typically…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques
