TL;DR
This paper evaluates large language models' ability to understand and pragmatically use culturally grounded figurative language in Arabic and English, revealing significant gaps in cultural reasoning and contextual use.
Contribution
It introduces a comprehensive evaluation framework and dataset for assessing LLMs' understanding and pragmatic use of figurative language across Arabic and English.
Findings
LLMs perform worse on Arabic proverbs than English ones.
Performance drops when using figurative language pragmatically.
Models struggle with connotative meanings, aligning less with human judgments.
Abstract
We present a comprehensive evaluation of the ability of large language models (LLMs) to process culturally grounded language, specifically to understand and pragmatically use figurative expressions that encode local knowledge and cultural nuance. Using figurative language as a proxy for cultural nuance and local knowledge, we design evaluation tasks for contextual understanding, pragmatic use, and connotation interpretation in Arabic and English. We evaluate 22 open- and closed-source LLMs on Egyptian Arabic idioms, multidialectal Arabic proverbs, and English proverbs. Our results show a consistent hierarchy: the average accuracy for Arabic proverbs is 4.29% lower than for English proverbs, and performance for Egyptian idioms is 10.28% lower than for Arabic proverbs. For the pragmatic use task, accuracy drops by 14.07% relative to understanding, though providing contextual idiomatic…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
