When Meaning Stays the Same, but Models Drift: Evaluating Quality of Service under Token-Level Behavioral Instability in LLMs
Xiao Li, Joel Kreuzwieser, Alan Peters

TL;DR
This paper introduces PBSS, a framework to measure how large language models' responses change with different token-level prompts that have the same meaning, revealing model-specific behavioral drift.
Contribution
The study presents a new diagnostic method for evaluating LLM stability under prompt rephrasing, highlighting the impact of tokenization and decoding on response consistency.
Findings
Model-specific response shifts under prompt variance
Statistical regularities linked to tokenization and decoding
Behavioral drift persists despite semantic equivalence
Abstract
We investigate how large language models respond to prompts that differ only in their token-level realization but preserve the same semantic intent, a phenomenon we call prompt variance. We propose Prompt-Based Semantic Shift (PBSS), a diagnostic framework for measuring behavioral drift in LLMs under semantically equivalent prompt rewordings. Applied to ten constrained tasks, PBSS reveals consistent, model-specific response shifts, suggesting statistical regularities linked to tokenization and decoding. These results highlight an overlooked dimension of model evaluation stability under rephrasing and suggest that tokenization strategies and decoding dynamics may contribute to post-training quality of service instability.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Text Readability and Simplification
Methodstravel james
