On the Worst Prompt Performance of Large Language Models

Bowen Cao; Deng Cai; Zhisong Zhang; Yuexian Zou; Wai Lam

arXiv:2406.10248·cs.CL·October 31, 2024·3 cites

On the Worst Prompt Performance of Large Language Models

Bowen Cao, Deng Cai, Zhisong Zhang, Yuexian Zou, Wai Lam

PDF

Open Access 2 Repos 1 Video

TL;DR

This paper introduces RobustAlpacaEval, a benchmark to evaluate the worst-case performance of large language models across diverse prompts, revealing significant variability and challenges in prompt robustness.

Contribution

The paper presents a new benchmark and analysis framework focusing on worst prompt performance, highlighting the variability and difficulty in improving LLM robustness.

Findings

01

Substantial performance variability across prompts, e.g., 45.48% difference in Llama-2-70B-chat.

02

Difficulty in identifying the worst prompt using existing methods.

03

Limited impact of current prompt engineering techniques on worst prompt performance.

Abstract

The performance of large language models (LLMs) is acutely sensitive to the phrasing of prompts, which raises significant concerns about their reliability in real-world scenarios. Existing studies often divide prompts into task-level instructions and case-level inputs and primarily focus on evaluating and improving robustness against variations in tasks-level instructions. However, this setup fails to fully address the diversity of real-world user queries and assumes the existence of task-specific datasets. To address these limitations, we introduce RobustAlpacaEval, a new benchmark that consists of semantically equivalent case-level queries and emphasizes the importance of using the worst prompt performance to gauge the lower bound of model performance. Extensive experiments on RobustAlpacaEval with ChatGPT and six open-source LLMs from the Llama, Mistral, and Gemma families uncover…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

On the Worst Prompt Performance of Large Language Models· slideslive

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Machine Learning and Algorithms

MethodsFocus