Large Language Models Generate Harmful Content Using a Distinct, Unified Mechanism

Hadas Orgad; Boyi Wei; Kaden Zheng; Martin Wattenberg; Peter Henderson; Seraphina Goldfarb-Tarrant; Yonatan Belinkov

arXiv:2604.09544·cs.CL·April 13, 2026

Large Language Models Generate Harmful Content Using a Distinct, Unified Mechanism

Hadas Orgad, Boyi Wei, Kaden Zheng, Martin Wattenberg, Peter Henderson, Seraphina Goldfarb-Tarrant, Yonatan Belinkov

PDF

TL;DR

This paper uncovers a compact, internal structure for harmful content generation in large language models, showing that alignment reshapes these harmful representations and that pruning can mitigate emergent misalignment.

Contribution

It demonstrates that harmfulness in LLMs depends on a distinct set of weights, which are reshaped by alignment and can be targeted through pruning to improve safety.

Findings

01

Harmful content generation relies on a compact set of weights consistent across harm types.

02

Aligned models show greater compression of harm-related weights than unaligned models.

03

Pruning harm-related weights reduces emergent misalignment in narrow domains.

Abstract

Large language models (LLMs) undergo alignment training to avoid harmful behaviors, yet the resulting safeguards remain brittle: jailbreaks routinely bypass them, and fine-tuning on narrow domains can induce ``emergent misalignment'' that generalizes broadly. Whether this brittleness reflects a fundamental lack of coherent internal organization for harmfulness remains unclear. Here we use targeted weight pruning as a causal intervention to probe the internal organization of harmfulness in LLMs. We find that harmful content generation depends on a compact set of weights that are general across harm types and distinct from benign capabilities. Aligned models exhibit a greater compression of harm generation weights than unaligned counterparts, indicating that alignment reshapes harmful representations internally--despite the brittleness of safety guardrails at the surface level. This…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.