On the Worst Prompt Performance of Large Language Models
Bowen Cao, Deng Cai, Zhisong Zhang, Yuexian Zou, Wai Lam

TL;DR
This paper introduces RobustAlpacaEval, a benchmark to evaluate the worst-case performance of large language models across diverse prompts, revealing significant variability and challenges in prompt robustness.
Contribution
The paper presents a new benchmark and analysis framework focusing on worst prompt performance, highlighting the variability and difficulty in improving LLM robustness.
Findings
Substantial performance variability across prompts, e.g., 45.48% difference in Llama-2-70B-chat.
Difficulty in identifying the worst prompt using existing methods.
Limited impact of current prompt engineering techniques on worst prompt performance.
Abstract
The performance of large language models (LLMs) is acutely sensitive to the phrasing of prompts, which raises significant concerns about their reliability in real-world scenarios. Existing studies often divide prompts into task-level instructions and case-level inputs and primarily focus on evaluating and improving robustness against variations in tasks-level instructions. However, this setup fails to fully address the diversity of real-world user queries and assumes the existence of task-specific datasets. To address these limitations, we introduce RobustAlpacaEval, a new benchmark that consists of semantically equivalent case-level queries and emphasizes the importance of using the worst prompt performance to gauge the lower bound of model performance. Extensive experiments on RobustAlpacaEval with ChatGPT and six open-source LLMs from the Llama, Mistral, and Gemma families uncover…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Machine Learning and Algorithms
MethodsFocus
