Humor in AI: Massive Scale Crowd-Sourced Preferences and Benchmarks for Cartoon Captioning
Jifan Zhang, Lalit Jain, Yang Guo, Jiayi Chen, Kuan Lok Zhou,, Siddharth Suresh, Andrew Wagenmaker, Scott Sievert, Timothy Rogers, Kevin, Jamieson, Robert Mankoff, Robert Nowak

TL;DR
This paper introduces a large-scale dataset of human ratings for cartoon captions, proposes new benchmarks for humor quality assessment, and evaluates current models, revealing their limitations in generating humorous content.
Contribution
It provides a massive crowd-sourced dataset, new evaluation benchmarks, and insights into the performance gaps of current AI models in humor generation.
Findings
Current models underperform top human humorists.
Fine-tuning methods like RLHF and DPO have limitations for creative tasks.
The dataset enables future research in AI humor understanding and generation.
Abstract
We present a novel multimodal preference dataset for creative tasks, consisting of over 250 million human ratings on more than 2.2 million captions, collected through crowdsourcing rating data for The New Yorker's weekly cartoon caption contest over the past eight years. This unique dataset supports the development and evaluation of multimodal large language models and preference-based fine-tuning algorithms for humorous caption generation. We propose novel benchmarks for judging the quality of model-generated captions, utilizing both GPT4 and human judgments to establish ranking-based evaluation strategies. Our experimental results highlight the limitations of current fine-tuning methods, such as RLHF and DPO, when applied to creative tasks. Furthermore, we demonstrate that even state-of-the-art models like GPT4 and Claude currently underperform top human contestants in generating…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsLanguage, Metaphor, and Cognition · Subtitles and Audiovisual Media · Humor Studies and Applications
MethodsDirect Preference Optimization
