Understanding LLM Behavior When Encountering User-Supplied Harmful Content in Harmless Tasks

Junjie Chu; Yiting Qu; Ye Leng; Michael Backes; Yun Shen; Savvas Zannettou; Yang Zhang

arXiv:2603.11914·cs.CR·March 13, 2026

Understanding LLM Behavior When Encountering User-Supplied Harmful Content in Harmless Tasks

Junjie Chu, Yiting Qu, Ye Leng, Michael Backes, Yun Shen, Savvas Zannettou, Yang Zhang

PDF

Open Access

TL;DR

This paper investigates whether large language models (LLMs) will process harmful user-supplied content during benign tasks, revealing a significant ethical vulnerability in current LLM safety measures.

Contribution

The study systematically evaluates LLM responses to harmful content in benign tasks, highlighting a previously overlooked content-level ethical risk and providing insights for improved safety protocols.

Findings

01

Current LLMs often process harmful content despite safety measures.

02

Harmful knowledge categories like Violence and Graphic content are more likely to elicit harmful responses.

03

Even the latest models like GPT-5.2 and Gemini-3-Pro fail to consistently refuse harmful content.

Abstract

Large Language Models (LLMs) are increasingly trained to align with human values, primarily focusing on task level, i.e., refusing to execute directly harmful tasks. However, a subtle yet crucial content-level ethical question is often overlooked: when performing a seemingly benign task, will LLMs -- like morally conscious human beings -- refuse to proceed when encountering harmful content in user-provided material? In this study, we aim to understand this content-level ethical question and systematically evaluate its implications for mainstream LLMs. We first construct a harmful knowledge dataset (i.e., non-compliant with OpenAI's usage policy) to serve as the user-supplied harmful content, with 1,357 entries across ten harmful categories. We then design nine harmless tasks (i.e., compliant with OpenAI's usage policy) to simulate the real-world benign tasks, grouped into three…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHate Speech and Cyberbullying Detection · Adversarial Robustness in Machine Learning · Ethics and Social Impacts of AI