When Harmless Words Harm: A New Threat to LLM Safety via Conceptual Triggers

Zhaoxin Zhang; Borui Chen; Yiming Hu; Youyang Qu; Tianqing Zhu; Longxiang Gao

arXiv:2511.21718·cs.CL·December 1, 2025

When Harmless Words Harm: A New Threat to LLM Safety via Conceptual Triggers

Zhaoxin Zhang, Borui Chen, Yiming Hu, Youyang Qu, Tianqing Zhu, Longxiang Gao

PDF

Open Access

TL;DR

This paper introduces MICM, a novel method exploiting conceptual triggers to bypass safety filters in large language models, revealing a new vulnerability in LLM alignment and safety mechanisms.

Contribution

The paper presents MICM, a model-agnostic jailbreak technique using conceptual morphology to subtly manipulate LLM outputs without triggering safety filters.

Findings

01

MICM achieves high success rates across multiple advanced LLMs.

02

MICM outperforms existing jailbreak techniques in effectiveness.

03

Safety mechanisms in commercial LLMs are vulnerable to covert value manipulation.

Abstract

Recent research on large language model (LLM) jailbreaks has primarily focused on techniques that bypass safety mechanisms to elicit overtly harmful outputs. However, such efforts often overlook attacks that exploit the model's capacity for abstract generalization, creating a critical blind spot in current alignment strategies. This gap enables adversaries to induce objectionable content by subtly manipulating the implicit social values embedded in model outputs. In this paper, we introduce MICM, a novel, model-agnostic jailbreak method that targets the aggregate value structure reflected in LLM responses. Drawing on conceptual morphology theory, MICM encodes specific configurations of nuanced concepts into a fixed prompt template through a predefined set of phrases. These phrases act as conceptual triggers, steering model outputs toward a specific value stance without triggering…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdversarial Robustness in Machine Learning · Topic Modeling · Hate Speech and Cyberbullying Detection