HumorRank: A Tournament-Based Leaderboard for Evaluating Humor Generation in Large Language Models
Edward Ajayi, Prasenjit Mitra

TL;DR
HumorRank introduces a tournament-based framework for evaluating and ranking humor generation in large language models, providing a scalable and interpretable benchmarking method grounded in the General Theory of Verbal Humor.
Contribution
It presents a novel evaluation framework using pairwise comparisons and tournament aggregation to produce consistent humor quality rankings for LLMs.
Findings
HumorRank produces statistically grounded model stratifications.
Humor quality correlates with mastery of comedic mechanisms, not just model size.
The framework is scalable and interpretable for benchmarking humor in LLMs.
Abstract
Evaluating humor in large language models (LLMs) is an open challenge because existing approaches yield isolated, incomparable metrics rather than unified model rankings, making it difficult to track progress across systems. We introduce HumorRank, a tournament-based evaluation framework and leaderboard for textual humor generation. Using SemEval-2026 MWAHAHA test dataset, we conduct an extensive automated pairwise evaluation across nine models spanning proprietary, open-weight, and specialized systems. Pairwise judgments grounded in the General Theory of Verbal Humor (GTVH) are aggregated via an Adaptive Swiss tournament, with Bradley-Terry Maximum Likelihood Estimation (MLE) producing globally consistent humor generation capability rankings. Our results demonstrate that HumorRank yields statistically grounded model stratifications, showing that humor quality is driven by mastery of…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
