Humor in AI: Massive Scale Crowd-Sourced Preferences and Benchmarks for   Cartoon Captioning

Jifan Zhang; Lalit Jain; Yang Guo; Jiayi Chen; Kuan Lok Zhou,; Siddharth Suresh; Andrew Wagenmaker; Scott Sievert; Timothy Rogers; Kevin; Jamieson; Robert Mankoff; Robert Nowak

arXiv:2406.10522·cs.LG·December 19, 2024

Humor in AI: Massive Scale Crowd-Sourced Preferences and Benchmarks for Cartoon Captioning

Jifan Zhang, Lalit Jain, Yang Guo, Jiayi Chen, Kuan Lok Zhou,, Siddharth Suresh, Andrew Wagenmaker, Scott Sievert, Timothy Rogers, Kevin, Jamieson, Robert Mankoff, Robert Nowak

PDF

Open Access 1 Repo 1 Datasets 1 Video

TL;DR

This paper introduces a large-scale dataset of human ratings for cartoon captions, proposes new benchmarks for humor quality assessment, and evaluates current models, revealing their limitations in generating humorous content.

Contribution

It provides a massive crowd-sourced dataset, new evaluation benchmarks, and insights into the performance gaps of current AI models in humor generation.

Findings

01

Current models underperform top human humorists.

02

Fine-tuning methods like RLHF and DPO have limitations for creative tasks.

03

The dataset enables future research in AI humor understanding and generation.

Abstract

We present a novel multimodal preference dataset for creative tasks, consisting of over 250 million human ratings on more than 2.2 million captions, collected through crowdsourcing rating data for The New Yorker's weekly cartoon caption contest over the past eight years. This unique dataset supports the development and evaluation of multimodal large language models and preference-based fine-tuning algorithms for humorous caption generation. We propose novel benchmarks for judging the quality of model-generated captions, utilizing both GPT4 and human judgments to establish ranking-based evaluation strategies. Our experimental results highlight the limitations of current fine-tuning methods, such as RLHF and DPO, when applied to creative tasks. Furthermore, we demonstrate that even state-of-the-art models like GPT4 and Claude currently underperform top human contestants in generating…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

yguooo/cartoon-caption-generation
pytorchOfficial

Datasets

yguooo/newyorker_caption_ranking
dataset· 212 dl
212 dl

Videos

Humor in AI: Massive Scale Crowd-Sourced Preferences and Benchmarks for Cartoon Captioning· slideslive

Taxonomy

TopicsLanguage, Metaphor, and Cognition · Subtitles and Audiovisual Media · Humor Studies and Applications

MethodsDirect Preference Optimization