Quantifying Language Models' Sensitivity to Spurious Features in Prompt   Design or: How I learned to start worrying about prompt formatting

Melanie Sclar; Yejin Choi; Yulia Tsvetkov; Alane Suhr

arXiv:2310.11324·cs.CL·July 3, 2024·41 cites

Quantifying Language Models' Sensitivity to Spurious Features in Prompt Design or: How I learned to start worrying about prompt formatting

Melanie Sclar, Yejin Choi, Yulia Tsvetkov, Alane Suhr

PDF

Open Access 1 Repo 1 Video

TL;DR

This paper investigates how sensitive large language models are to prompt formatting variations, revealing significant performance fluctuations and proposing a systematic evaluation method called FormatSpread.

Contribution

The study highlights the importance of prompt formatting in LLM performance, demonstrating sensitivity across models and introducing FormatSpread for robust evaluation.

Findings

01

Performance varies up to 76 accuracy points due to prompt formatting.

02

Sensitivity persists regardless of model size or tuning.

03

Performance across formats is weakly correlated between models.

Abstract

As large language models (LLMs) are adopted as a fundamental component of language technologies, it is crucial to accurately characterize their performance. Because choices in prompt design can strongly influence model behavior, this design process is critical in effectively using any modern pre-trained generative language model. In this work, we focus on LLM sensitivity to a quintessential class of meaning-preserving design choices: prompt formatting. We find that several widely used open-source LLMs are extremely sensitive to subtle changes in prompt formatting in few-shot settings, with performance differences of up to 76 accuracy points when evaluated using LLaMA-2-13B. Sensitivity remains even when increasing model size, the number of few-shot examples, or performing instruction tuning. Our analysis suggests that work evaluating LLMs with prompting-based methods would benefit from…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

msclar/formatspread
pytorchOfficial

Videos

Quantifying Language Models' Sensitivity to Spurious Features in Prompt Design or: How I learned to start worrying about prompt formatting· slideslive

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Ferroelectric and Negative Capacitance Devices

MethodsSparse Evolutionary Training · Focus