Text Is Not All You Need: Multimodal Prompting Helps LLMs Understand Humor
Ashwin Baluja

TL;DR
This paper demonstrates that multimodal prompting, incorporating both text and spoken cues, enhances large language models' ability to understand and explain humor, which is inherently multimodal in nature.
Contribution
The study introduces a simple multimodal prompting method using speech cues to improve humor understanding in LLMs, surpassing text-only approaches.
Findings
Multimodal prompts improve humor explanation accuracy.
Speech cues enhance LLM performance across datasets.
Multimodal approach outperforms text-only methods.
Abstract
While Large Language Models (LLMs) have demonstrated impressive natural language understanding capabilities across various text-based tasks, understanding humor has remained a persistent challenge. Humor is frequently multimodal, relying on phonetic ambiguity, rhythm and timing to convey meaning. In this study, we explore a simple multimodal prompting approach to humor understanding and explanation. We present an LLM with both the text and the spoken form of a joke, generated using an off-the-shelf text-to-speech (TTS) system. Using multimodal cues improves the explanations of humor compared to textual prompts across all tested datasets.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsLanguage, Metaphor, and Cognition · Humor Studies and Applications · American Literature and Humor Studies
