PUB: A Pragmatics Understanding Benchmark for Assessing LLMs' Pragmatics   Capabilities

Settaluri Lakshmi Sravanthi; Meet Doshi; Tankala Pavan Kalyan; Rudra; Murthy; Pushpak Bhattacharyya; Raj Dabre

arXiv:2401.07078·cs.CL·January 17, 2024·1 cites

PUB: A Pragmatics Understanding Benchmark for Assessing LLMs' Pragmatics Capabilities

Settaluri Lakshmi Sravanthi, Meet Doshi, Tankala Pavan Kalyan, Rudra, Murthy, Pushpak Bhattacharyya, Raj Dabre

PDF

Open Access 1 Datasets 1 Video

TL;DR

This paper introduces PUB, a benchmark dataset for evaluating large language models' pragmatic understanding across various phenomena, revealing current models' limitations compared to human performance.

Contribution

The paper presents a new benchmark dataset with 14 tasks across four pragmatics phenomena, enabling systematic evaluation of LLMs' pragmatic reasoning abilities.

Findings

01

Fine-tuning improves smaller models' pragmatics

02

Larger models perform similarly with or without chat adaptation

03

Models show variability and gaps compared to human performance

Abstract

LLMs have demonstrated remarkable capability for understanding semantics, but they often struggle with understanding pragmatics. To demonstrate this fact, we release a Pragmatics Understanding Benchmark (PUB) dataset consisting of fourteen tasks in four pragmatics phenomena, namely, Implicature, Presupposition, Reference, and Deixis. We curated high-quality test sets for each task, consisting of Multiple Choice Question Answers (MCQA). PUB includes a total of 28k data points, 6.1k of which have been created by us, and the rest are adapted from existing datasets. We evaluated nine models varying in the number of parameters and type of training. Our study indicates that fine-tuning for instruction-following and chat significantly enhances the pragmatics capabilities of smaller language models. However, for larger models, the base versions perform comparably with their chat-adapted…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

Huangtubaye233/AltPrag
dataset· 10 dl
10 dl

Videos

PUB: A Pragmatics Understanding Benchmark for Assessing LLMs' Pragmatics Capabilities· underline

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling · Text Readability and Simplification

MethodsBalanced Selection