E-Bench: Towards Evaluating the Ease-of-Use of Large Language Models

Zhenyu Zhang; Bingguang Hao; Jinpeng Li; Zekai Zhang; Dongyan Zhao

arXiv:2406.10950·cs.CL·June 18, 2024

E-Bench: Towards Evaluating the Ease-of-Use of Large Language Models

Zhenyu Zhang, Bingguang Hao, Jinpeng Li, Zekai Zhang, Dongyan Zhao

PDF

Open Access

TL;DR

This paper introduces E-Bench, a benchmark for evaluating the robustness and ease-of-use of large language models against prompt perturbations like paraphrasing, simplification, colloquialism, and typos, revealing that larger models are more robust but still not user-friendly enough.

Contribution

The paper presents E-Bench, a systematic benchmark for assessing LLMs' stability to prompt perturbations, addressing a gap in evaluating model robustness in real-world scenarios.

Findings

01

Larger models show improved robustness to prompt perturbations

02

Prompt perturbations significantly degrade LLM performance

03

There is still a considerable gap in making LLMs user-friendly

Abstract

Most large language models (LLMs) are sensitive to prompts, and another synonymous expression or a typo may lead to unexpected results for the model. Composing an optimal prompt for a specific demand lacks theoretical support and relies entirely on human experimentation, which poses a considerable obstacle to popularizing generative artificial intelligence. However, there is no systematic analysis of the stability of LLMs in resisting prompt perturbations in real-world scenarios. In this work, we propose to evaluate the ease-of-use of LLMs and construct E-Bench, simulating the actual situation of human use from synonymous perturbation (including paraphrasing, simplification, and colloquialism) and typographical perturbation (such as typing). On this basis, we also discuss the combination of these two types of perturbation and analyze the main reasons for performance degradation.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsRecommender Systems and Techniques · Data Quality and Management