What is the best model? Application-driven Evaluation for Large Language   Models

Shiguo Lian; Kaikai Zhao; Xinhui Liu; Xuejiao Lei; Bikun Yang; Wenjing; Zhang; Kai Wang; Zhaoxiang Liu

arXiv:2406.10307·cs.CL·June 18, 2024·1 cites

What is the best model? Application-driven Evaluation for Large Language Models

Shiguo Lian, Kaikai Zhao, Xinhui Liu, Xuejiao Lei, Bikun Yang, Wenjing, Zhang, Kai Wang, Zhaoxiang Liu

PDF

Open Access

TL;DR

This paper introduces A-Eval, a comprehensive benchmark for evaluating large language models based on practical application tasks, providing guidance for selecting the most suitable model while considering cost and performance.

Contribution

The paper presents a new application-driven evaluation benchmark with a categorized dataset and an effective assessment method for large language models.

Findings

01

Model scale correlates with task performance.

02

Task difficulty impacts model effectiveness.

03

Guidelines for optimal model selection are proposed.

Abstract

General large language models enhanced with supervised fine-tuning and reinforcement learning from human feedback are increasingly popular in academia and industry as they generalize foundation models to various practical tasks in a prompt manner. To assist users in selecting the best model in practical application scenarios, i.e., choosing the model that meets the application requirements while minimizing cost, we introduce A-Eval, an application-driven LLMs evaluation benchmark for general large language models. First, we categorize evaluation tasks into five main categories and 27 sub-categories from a practical application perspective. Next, we construct a dataset comprising 678 question-and-answer pairs through a process of collecting, annotating, and reviewing. Then, we design an objective and effective evaluation method and evaluate a series of LLMs of different scales on A-Eval.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling