Does Tone Change the Answer? Evaluating Prompt Politeness Effects on Modern LLMs: GPT, Gemini, and LLaMA
Hanyu Cai, Binqi Shen, Lier Jin, Lan Hu, Xiaojing Fan

TL;DR
This study systematically evaluates how linguistic tone and politeness in prompts affect the accuracy of modern LLMs like GPT, Gemini, and LLaMA across various tasks, revealing model- and domain-specific sensitivities.
Contribution
It introduces a framework for assessing tone effects on LLM performance and provides empirical insights into the robustness of these models to prompt politeness variations.
Findings
Neutral and polite prompts generally improve accuracy over rude prompts.
Tone effects are significant only in some Humanities tasks for GPT and LLaMA.
Model sensitivity to tone varies by model and domain, with Gemini being more robust.
Abstract
Prompt engineering has emerged as a critical factor influencing large language model (LLM) performance, yet the impact of pragmatic elements such as linguistic tone and politeness remains underexplored, particularly across different model families. In this work, we propose a systematic evaluation framework to examine how interaction tone affects model accuracy and apply it to three recently released and widely available LLMs: GPT-4o mini (OpenAI), Gemini 2.0 Flash (Google DeepMind), and Llama 4 Scout (Meta). Using the MMMLU benchmark, we evaluate model performance under Very Polite, Neutral, and Very Rude prompt variants across six tasks spanning STEM and Humanities domains, and analyze pairwise accuracy differences with statistical significance testing. Our results show that tone sensitivity is both model-dependent and domain-specific. Neutral or Very Polite prompts generally yield…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
