What's Taboo for You? - An Empirical Evaluation of LLMs Behavior Toward Sensitive Content
Alfio Ferrara, Sergio Picascia, Laura Pinnavaia, Vojimir Ranitovic, Elisabetta Rocchetti, Alice Tuveri

TL;DR
This paper empirically investigates how GPT-4o-mini and similar LLMs implicitly moderate sensitive content during paraphrasing and classification, revealing their tendencies to reduce derogatory language without explicit instructions.
Contribution
It provides the first systematic analysis of implicit content moderation behaviors of LLMs like GPT-4o-mini in paraphrasing and sensitivity classification tasks.
Findings
GPT-4o-mini reduces derogatory and taboo language in paraphrasing.
LLMs show significant sensitivity shifts toward less sensitive classes.
Zero-shot classification performance is comparable to traditional methods.
Abstract
Proprietary Large Language Models (LLMs) have shown tendencies toward politeness, formality, and implicit content moderation. While previous research has primarily focused on explicitly training models to moderate and detoxify sensitive content, there has been limited exploration of whether LLMs implicitly sanitize language without explicit instructions. This study empirically analyzes the implicit moderation behavior of GPT-4o-mini when paraphrasing sensitive content and evaluates the extent of sensitivity shifts. Our experiments indicate that GPT-4o-mini systematically moderates content toward less sensitive classes, with substantial reductions in derogatory and taboo language. Also, we evaluate the zero-shot capabilities of LLMs in classifying sentence sensitivity, comparing their performances against traditional methods.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
