Engagement Undermines Safety: How Stereotypes and Toxicity Shape Humor in Language Models

Atharvan Dogra; Soumya Suvra Ghosal; Ameet Deshpande; Ashwin Kalyan; Dinesh Manocha

arXiv:2510.18454·cs.CL·October 22, 2025

Engagement Undermines Safety: How Stereotypes and Toxicity Shape Humor in Language Models

Atharvan Dogra, Soumya Suvra Ghosal, Ameet Deshpande, Ashwin Kalyan, Dinesh Manocha

PDF

Open Access 1 Video

TL;DR

This study investigates how humor generation in large language models can inadvertently reinforce stereotypes and toxicity, revealing biases and structural issues in humor and safety alignment.

Contribution

It introduces a comprehensive evaluation framework linking humor, stereotypes, and toxicity in LLMs, highlighting bias amplification and structural embedding of harmful content.

Findings

01

Harmful outputs receive higher humor scores, especially with role-based prompts.

02

Harmful cues increase predictive uncertainty and can make harmful punchlines more expected.

03

Satire generation in LLMs increases stereotypicality and toxicity, affecting human perceptions.

Abstract

Large language models are increasingly used for creative writing and engagement content, raising safety concerns about the outputs. Therefore, casting humor generation as a testbed, this work evaluates how funniness optimization in modern LLM pipelines couples with harmful content by jointly measuring humor, stereotypicality, and toxicity. This is further supplemented by analyzing incongruity signals through information-theoretic metrics. Across six models, we observe that harmful outputs receive higher humor scores which further increase under role-based prompting, indicating a bias amplification loop between generators and evaluators. Information-theoretic analyses show harmful cues widen predictive uncertainty and surprisingly, can even make harmful punchlines more expected for some models, suggesting structural embedding in learned humor distributions. External validation on an…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Engagement Undermines Safety: How Stereotypes and Toxicity Shape Humor in Language Models· underline

Taxonomy

TopicsHumor Studies and Applications · Language, Metaphor, and Cognition · Psychology of Moral and Emotional Judgment