A Systematic Survey and Critical Review on Evaluating Large Language   Models: Challenges, Limitations, and Recommendations

Md Tahmid Rahman Laskar; Sawsan Alqahtani; M Saiful Bari; Mizanur; Rahman; Mohammad Abdullah Matin Khan; Haidar Khan; Israt Jahan; Amran; Bhuiyan; Chee Wei Tan; Md Rizwan Parvez; Enamul Hoque; Shafiq Joty; Jimmy; Huang

arXiv:2407.04069·cs.CL·October 4, 2024·2 cites

A Systematic Survey and Critical Review on Evaluating Large Language Models: Challenges, Limitations, and Recommendations

Md Tahmid Rahman Laskar, Sawsan Alqahtani, M Saiful Bari, Mizanur, Rahman, Mohammad Abdullah Matin Khan, Haidar Khan, Israt Jahan, Amran, Bhuiyan, Chee Wei Tan, Md Rizwan Parvez, Enamul Hoque, Shafiq Joty, Jimmy, Huang

PDF

Open Access 1 Repo 1 Video

TL;DR

This paper systematically reviews the challenges and limitations in evaluating Large Language Models, highlighting inconsistencies and proposing recommendations for more reliable and reproducible assessments.

Contribution

It provides a comprehensive analysis of evaluation challenges and offers guidelines to improve the reliability and consistency of LLM assessments.

Findings

01

Identification of key challenges in LLM evaluation

02

Analysis of factors causing evaluation inconsistencies

03

Recommendations for standardized evaluation practices

Abstract

Large Language Models (LLMs) have recently gained significant attention due to their remarkable capabilities in performing diverse tasks across various domains. However, a thorough evaluation of these models is crucial before deploying them in real-world applications to ensure they produce reliable performance. Despite the well-established importance of evaluating LLMs in the community, the complexity of the evaluation process has led to varied evaluation setups, causing inconsistencies in findings and interpretations. To address this, we systematically review the primary challenges and limitations causing these inconsistencies and unreliable evaluations in various steps of LLM evaluation. Based on our critical review, we present our perspectives and recommendations to ensure LLM evaluations are reproducible, reliable, and robust.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

ntunlp/critical-review-of-llm-eval
noneOfficial

Videos

A Systematic Survey and Critical Review on Evaluating Large Language Models: Challenges, Limitations, and Recommendations· underline

Taxonomy

TopicsTopic Modeling

MethodsSoftmax · Attention Is All You Need