CEI: A Benchmark for Evaluating Pragmatic Reasoning in Language Models

Jon Chun; Hannah Sussman; Adrian Mangine; Murathan Kocaman; Kirill Sidorko; Abhigya Koirala; Andre McCloud; Gwen Eisenbeis; Wisdom Akanwe; Moustapha Gassama; Eliezer Gonzalez Chirinos; Anne-Duncan Enright; Peter Dunson; Tiffanie Ng; Anna von Rosenstiel; Godwin Idowu

arXiv:2603.09993·cs.CL·March 12, 2026

CEI: A Benchmark for Evaluating Pragmatic Reasoning in Language Models

Jon Chun, Hannah Sussman, Adrian Mangine, Murathan Kocaman, Kirill Sidorko, Abhigya Koirala, Andre McCloud, Gwen Eisenbeis, Wisdom Akanwe, Moustapha Gassama, Eliezer Gonzalez Chirinos, Anne-Duncan Enright, Peter Dunson, Tiffanie Ng, Anna von Rosenstiel, Godwin Idowu

PDF

Open Access

TL;DR

The paper introduces the CEI Benchmark, a dataset of 300 scenarios designed to evaluate large language models' ability to perform pragmatic reasoning in diverse, real-world communication contexts involving ambiguity and social dynamics.

Contribution

It provides a new, comprehensive benchmark with detailed annotation methodology for assessing pragmatic inference in language models.

Findings

01

LLMs struggle with pragmatic inference, especially in ambiguous scenarios.

02

Inter-annotator agreement is low, highlighting the inherent difficulty of pragmatic reasoning.

03

The dataset covers multiple pragmatic subtypes across various social settings.

Abstract

Pragmatic reasoning, inferring intended meaning beyond literal semantics, underpins everyday communication yet remains difficult for large language models. We present the Contextual Emotional Inference (CEI) Benchmark: 300 human-validated scenarios for evaluating how well LLMs disambiguate pragmatically complex utterances. Each scenario pairs a situational context and speaker-listener roles (with explicit power relations) against an ambiguous utterance. The dataset covers five pragmatic subtypes (sarcasm/irony, mixed signals, strategic politeness, passive aggression, deflection/misdirection) drawn from workplace, family, social, and service settings, with three power configurations (peer, higher-to-lower, lower-to-higher). Three trained annotators independently labeled every scenario. Inter-annotator agreement (Fleiss' kappa = 0.06-0.25 by subtype) is low but expected: pragmatic…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSentiment Analysis and Opinion Mining · Topic Modeling · Mental Health via Writing