What Did I Do Wrong? Quantifying LLMs' Sensitivity and Consistency to Prompt Engineering

Federico Errica; Giuseppe Siracusano; Davide Sanvito; Roberto Bifulco

arXiv:2406.12334·cs.LG·August 26, 2025·5 cites

What Did I Do Wrong? Quantifying LLMs' Sensitivity and Consistency to Prompt Engineering

Federico Errica, Giuseppe Siracusano, Davide Sanvito, Roberto Bifulco

PDF

Open Access 1 Repo 1 Video

TL;DR

This paper introduces metrics to quantify how sensitive and consistent large language models are to prompt variations, aiding developers in debugging and improving prompt engineering for more reliable AI outputs.

Contribution

It proposes novel sensitivity and consistency metrics for classification tasks, providing tools to analyze and improve LLM robustness beyond traditional performance measures.

Findings

01

Sensitivity varies significantly with prompt rephrasing.

02

Consistency correlates with model robustness.

03

Metrics help identify failure modes in LLM predictions.

Abstract

Large Language Models (LLMs) changed the way we design and interact with software systems. Their ability to process and extract information from text has drastically improved productivity in a number of routine tasks. Developers that want to include these models in their software stack, however, face a dreadful challenge: debugging LLMs' inconsistent behavior across minor variations of the prompt. We therefore introduce two metrics for classification tasks, namely sensitivity and consistency, which are complementary to task performance. First, sensitivity measures changes of predictions across rephrasings of the prompt, and does not require access to ground truth labels. Instead, consistency measures how predictions vary across rephrasings for elements of the same class. We perform an empirical comparison of these metrics on text classification tasks, using them as guideline for…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

nec-research/sensitivity-consistency-LLM
noneOfficial

Videos

What Did I Do Wrong? Quantifying LLMs’ Sensitivity and Consistency to Prompt Engineering· underline

Taxonomy

TopicsHigher Education Learning Practices · Diverse Research and Applications · Legal Education and Practice Innovations