Measuring Distribution Shift in User Prompts and Its Effects on LLM Performance
Parker Seegmiller, Sarah Masud Preum

TL;DR
This paper introduces the LENS framework to measure how natural prompt distribution shifts affect the performance of deployed large language models, revealing significant performance drops across diverse user groups and over time.
Contribution
The paper presents a large-scale, data-centric evaluation method for quantifying prompt distribution shifts and their impact on LLM performance in real-world settings.
Findings
Moderate prompt shifts cause 73% average performance loss in LLMs.
Performance degradation is more severe across different user groups and regions.
Prompt distribution shifts over time are strongly correlated with LLM performance drops.
Abstract
LLMs are increasingly deployed in dynamic, real-world settings, where the distribution of user prompts can shift substantially over time as new tasks, prompts, and users are introduced to a deployed model. Such natural prompt distribution shift poses a major challenge to LLM reliability, particularly for specialized models designed for narrow domains or user populations. Despite attention to out-of-distribution robustness, there is very limited exploration of measuring natural prompt distribution shift in prior work, and its impact on deployed LLMs remains poorly understood. We introduce the LLM Evaluation under Natural prompt Shift (LENS) framework: a data-centric approach for quantifying natural prompt distribution shift and evaluating its effect on the performance of deployed LLMs. We perform a large-scale evaluation using 192 real-world post-deployment prompt shift settings over…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
