TL;DR
HumorBench is a new benchmark for evaluating large language models' ability to understand and explain humor in cartoons, emphasizing reasoning beyond STEM domains and revealing transferability of reasoning skills.
Contribution
Introduces HumorBench, a novel dataset and evaluation framework for assessing LLMs' humor reasoning and explanation capabilities beyond STEM topics.
Findings
STEM reasoning skills transfer to humor understanding
Models trained only on STEM data perform well on humor tasks
Scaling reasoning tokens at test time has mixed effects
Abstract
We present HumorBench, a benchmark designed to evaluate large language models' (LLMs) ability to reason about and explain sophisticated humor in cartoon captions. As reasoning models increasingly saturate existing benchmarks in mathematics and science, novel and challenging evaluations of model intelligence beyond STEM domains are essential. Reasoning is fundamentally involved in text-based humor comprehension, requiring the identification of connections between concepts in cartoons/captions and external cultural references, wordplays, and other mechanisms. HumorBench includes approximately 300 unique cartoon-caption pairs from the New Yorker Caption Contest and Cartoonstock.com, with expert-annotated evaluation rubrics identifying essential joke elements. LLMs are evaluated based on their explanations towards the humor and abilities in identifying the joke elements. To perform well on…
Peer Reviews
Decision·ICLR 2026 Conference Desk Rejected Submission
- During the annotation of the dataset, elements that could subjectively influence humor understanding were deliberately removed. - The authors avoid using multiple-choice and ranking formats, as these can limit the model’s reasonable divergence in humor understanding. Additionally, fixed options may inadvertently hint at the actual punchline, making it unclear whether the model is reasoning or simply guessing. This is a reasonable and effective improvement. - The authors designed a multi-round
- Some steps in the dataset construction process are not described in detail. (See Questions.) - The absence of images in the dataset may weaken its impact, as the main limitation of current multimodal large models lies in their inability to effectively understand humor from images.
1. HumorBench introduces a new non-STEM reasoning task that isolates objective humour understanding, avoiding confounding subjective funniness and providing a valuable probe of high-level reasoning. 2. Through extensive experiments on frontier LLMs, the study reveals clear transfer from STEM reasoning to humour comprehension and mixed effects of test-time scaling, demonstrating both the benchmark’s sensitivity and the cross-domain generality of reasoning abilities.
1. Besides identifying individual elements, the evaluation should also consider the interactions or causal relations among them. Humour understanding depends on how well the model connects these elements, not just mentions them. Element-based evaluation can serve as a supplement. In fact, using an LLM as a judge to compare the response with the golden label can already effectively reflect joke understanding. 2. This conclusion is reasonable, but the paper does not explain why reasoning models i
- The introduction of a rubric-based evaluation framework for humor understanding is novel and well-motivated. - The proposed HumorBench benchmark has the potential to offer valuable insights and guide future research in multimodal humor understanding. - The experimental design and analyses are comprehensive, effectively highlighting the limitations and challenges faced by current MLLMs in this domain.
- The benchmark size is relatively small, containing only about 300 cartoons drawn from a limited range of sources (mainly The New Yorker Caption Contest and Cartoonstock.com). This restricted scope may limit the dataset’s generalizability and future applicability; - While the authors argue that using humor elements as rubrics enables objective humor understanding, this claim raises concerns. The definition and categorization of these humor elements may be subjective and culturally biased, as di
1. The paper identifies a real gap in LLM evaluation: existing benchmarks overemphasize STEM reasoning, while humor demands cultural, linguistic, and inferential reasoning. 2. The dataset is constructed with rigorous curation, including expert validation and removal of inconsistent annotations, contributing to dataset quality and reliability. 3. It provides systematic evaluation, tests across many models, compares base vs. reasoning optimized variants, and analyzes correlations with other benc
1. The size of the dataset may be limited. Although curated carefully, about 300 cartoon-caption pairs (499 elements) is relatively small compared to other reasoning benchmarks, potentially limiting generalization. 2. Need further justification for the dataset diversity, especially w.r.t. cultural background and humor type. Considering the original sources are largely Western (New Yorker, Cartoonstock), and the dataset may not test models across diverse humor traditions. 3. The validation shows
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
