What's Taboo for You? - An Empirical Evaluation of LLMs Behavior Toward Sensitive Content

Alfio Ferrara; Sergio Picascia; Laura Pinnavaia; Vojimir Ranitovic; Elisabetta Rocchetti; Alice Tuveri

arXiv:2507.23319·cs.CL·August 1, 2025

What's Taboo for You? - An Empirical Evaluation of LLMs Behavior Toward Sensitive Content

Alfio Ferrara, Sergio Picascia, Laura Pinnavaia, Vojimir Ranitovic, Elisabetta Rocchetti, Alice Tuveri

PDF

TL;DR

This paper empirically investigates how GPT-4o-mini and similar LLMs implicitly moderate sensitive content during paraphrasing and classification, revealing their tendencies to reduce derogatory language without explicit instructions.

Contribution

It provides the first systematic analysis of implicit content moderation behaviors of LLMs like GPT-4o-mini in paraphrasing and sensitivity classification tasks.

Findings

01

GPT-4o-mini reduces derogatory and taboo language in paraphrasing.

02

LLMs show significant sensitivity shifts toward less sensitive classes.

03

Zero-shot classification performance is comparable to traditional methods.

Abstract

Proprietary Large Language Models (LLMs) have shown tendencies toward politeness, formality, and implicit content moderation. While previous research has primarily focused on explicitly training models to moderate and detoxify sensitive content, there has been limited exploration of whether LLMs implicitly sanitize language without explicit instructions. This study empirically analyzes the implicit moderation behavior of GPT-4o-mini when paraphrasing sensitive content and evaluates the extent of sensitivity shifts. Our experiments indicate that GPT-4o-mini systematically moderates content toward less sensitive classes, with substantial reductions in derogatory and taboo language. Also, we evaluate the zero-shot capabilities of LLMs in classifying sentence sensitivity, comparing their performances against traditional methods.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.