Why Do Large Language Models Generate Harmful Content?

Rajesh Ganguli; Raha Moraffah

arXiv:2604.11663·cs.AI·April 14, 2026

Why Do Large Language Models Generate Harmful Content?

Rajesh Ganguli, Raha Moraffah

PDF

TL;DR

This paper uses causal mediation analysis to identify that harmful content in large language models mainly originates in later layers, especially within MLP blocks and specific neurons, revealing how harmful signals propagate through the model.

Contribution

It introduces a causal analysis framework to pinpoint the layers, modules, and neurons responsible for harmful content generation in LLMs, providing insights into the propagation of harmful signals.

Findings

01

Harmful content mainly arises in later layers of LLMs.

02

Failures in MLP blocks are the primary cause of harmful generation.

03

Specific neurons act as gating mechanisms for harmful content.

Abstract

Large Language Models (LLMs) have been shown to generate harmful content. However, the underlying causes of such behavior remain under explored. We propose a causal mediation analysis-based approach to identify the causal factors responsible for harmful generation. Our method performs a multi-granular analysis across model layers, modules (MLP and attention blocks), and individual neurons. Extensive experiments on state-of-the-art LLMs indicate that harmful generation arises in the later layers of the model, results primarily from failures in MLP blocks rather than attention blocks, and is associated with neurons that act as a gating mechanism for harmful generation. The results indicate that the early layers in the model are used for a contextual understanding of harmfulness in a prompt, which is then propagated through the model, to generate harmfulness in the late layers, as well as…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.