Investigating Counterfactual Unfairness in LLMs towards Identities through Humor
Shubin Kim, Yejin Son, Junyeong Park, Keummin Ka, Seungbeen Lee, Jaeyoung Lee, Hyeju Jang, Alice Oh, Youngjae Yu

TL;DR
This paper explores how language models exhibit social biases through humor, revealing disparities in responses based on speaker identities, which complicates fairness and cultural alignment efforts.
Contribution
It introduces a framework and bias metrics to analyze counterfactual unfairness in humor-related tasks in language models, highlighting disparities linked to social identities.
Findings
Models refuse jokes by privileged speakers 67.5% more often.
Privileged speakers' jokes are judged malicious 64.7% more frequently.
Social harm ratings are up to 1.5 points higher for privileged speakers.
Abstract
Humor holds up a mirror to social perception: what we find funny often reflects who we are and how we judge others. When language models engage with humor, their reactions expose the social assumptions they have internalized from training data. In this paper, we investigate counterfactual unfairness through humor by observing how the model's responses change when we swap who speaks and who is addressed while holding other factors constant. Our framework spans three tasks: humor generation refusal, speaker intention inference, and relational/societal impact prediction, covering both identity-agnostic humor and identity-specific disparagement humor. We introduce interpretable bias metrics that capture asymmetric patterns under identity swaps. Experiments across state-of-the-art models reveal consistent relational disparities: jokes told by privileged speakers are refused up to 67.5% more…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
