CleanComedy: Creating Friendly Humor through Generative Techniques
Dmitry Vikhorev, Daria Galimzianova, Svetlana Gorovaia, Elizaveta, Zhemchuzhina, Ivan P. Yamshchikov

TL;DR
This paper introduces CleanComedy, a new toxicity-filtered humor dataset in English and Russian, and evaluates its effectiveness in improving humor generation models while addressing toxicity issues.
Contribution
The paper presents a novel, partially annotated, toxicity-filtered humor corpus and assesses its impact on generating safer, higher-quality jokes using generative models.
Findings
CleanComedy dataset reduces toxicity in generated jokes.
Models trained on CleanComedy produce more humorous and less toxic jokes.
Survey confirms effectiveness of data filtering in humor quality improvement.
Abstract
Humor generation is a challenging task in natural language processing due to limited resources and the quality of existing datasets. Available humor language resources often suffer from toxicity and duplication, limiting their effectiveness for training robust models. This paper proposes CleanComedy, a specialized, partially annotated toxicity-filtered corpus of English and Russian jokes collected from various sources. We study the effectiveness of our data filtering approach through a survey on humor and toxicity levels in various joke groups. In addition, we study advances in computer humor generation by comparing jokes written by humans with various groups of generative jokes, including our baseline models trained on the CleanComedy datasets.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHumor Studies and Applications · Comics and Graphic Narratives
