Evaluating Pragmatic Reasoning in Large Language Models: Evidence from Scalar Diversity
Ye-eun Cho

TL;DR
This study investigates how different evaluation methods affect the assessment of pragmatic reasoning in large language models, revealing that results depend heavily on task design and model specifics.
Contribution
It compares probability measurement and metalinguistic prompting, showing that pragmatic reasoning varies across models and conditions, influenced by evaluation strategies.
Findings
Neither evaluation method consistently outperforms the other.
Pragmatic behavior varies across model families, prompting strategies, and task structures.
Scalar diversity gradients appear only in specific model-condition combinations.
Abstract
Evaluating pragmatic reasoning in large language models (LLMs) remains challenging because model behavior can vary depending on evaluation methods. Previous studies suggest that prompt-based judgments may diverge from models' internal probability distributions, raising questions about whether observed performance reflects underlying competence or task-induced behavior. This study examines this issue using scalar diversity as a graded diagnostic for pragmatic inference. Following Hu & Levy (2023), this study compares direct probability measurement and metalinguistic prompting across multiple models and experimental settings. The results show that neither evaluation method consistently outperforms the other and that pragmatic behavior varies substantially across model families, prompting strategies, and task structures. Moreover, scalar diversity gradients emerge only in specific…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
