Learning to Think Like a Cartoon Captionist: Incongruity-Resolution Supervision for Multimodal Humor Understanding

Hatice Merve Vural; Doga Kukul; Ege Erdem Ozlu; Demir Ekin Arikan; Bob Mankoff; Erkut Erdem; Aykut Erdem

arXiv:2604.15210·cs.AI·April 17, 2026

Learning to Think Like a Cartoon Captionist: Incongruity-Resolution Supervision for Multimodal Humor Understanding

Hatice Merve Vural, Doga Kukul, Ege Erdem Ozlu, Demir Ekin Arikan, Bob Mankoff, Erkut Erdem, Aykut Erdem

PDF

TL;DR

This paper introduces IRS, a framework that decomposes humor understanding into structured reasoning components, improving multimodal humor comprehension and outperforming baselines on benchmark tasks.

Contribution

IRS is a novel supervision method that explicitly models the reasoning process behind humor understanding, grounded in cognitive theory and expert practice.

Findings

01

IRS outperforms strong baselines on caption matching and ranking tasks.

02

Largest model approaches expert-level performance on ranking.

03

Zero-shot transfer shows IRS learns generalizable reasoning patterns.

Abstract

Humor is one of the few cognitive tasks where getting the reasoning right matters as much as getting the answer right. While recent work evaluates humor understanding on benchmarks such as the New Yorker Cartoon Caption Contest (NYCC), it largely treats it as black-box prediction, overlooking the structured reasoning processes underlying humor comprehension. We introduce IRS (Incongruity-Resolution Supervision), a framework that decomposes humor understanding into three components: incongruity modeling, which identifies mismatches in the visual scene; resolution modeling, which constructs coherent reinterpretations of these mismatches; and preference alignment, which evaluates candidate interpretations under human judgments. Grounded in incongruity-resolution theory and expert captionist practice, IRS supervises intermediate reasoning process through structured traces that make the path…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.