Hiding in Plain Text: Detecting Concealed Jailbreaks via Activation Disentanglement

Amirhossein Farzam; Majid Behabahani; Mani Malek; Yuriy Nevmyvaka; Guillermo Sapiro

arXiv:2602.19396·cs.AI·February 24, 2026

Hiding in Plain Text: Detecting Concealed Jailbreaks via Activation Disentanglement

Amirhossein Farzam, Majid Behabahani, Mani Malek, Yuriy Nevmyvaka, Guillermo Sapiro

PDF

Open Access

TL;DR

This paper presents a novel self-supervised framework for detecting hidden malicious prompts in large language models by disentangling semantic factors in activations, improving robustness against framing manipulations.

Contribution

It introduces ReDAct, a representation disentanglement module, and FrameShield, an anomaly detector, to identify concealed jailbreak prompts across multiple LLMs.

Findings

01

ReDAct effectively disentangles goal and framing signals in LLM activations.

02

FrameShield improves detection accuracy with minimal computational cost.

03

Disentanglement reveals distinct goal and framing profiles, aiding interpretability.

Abstract

Large language models (LLMs) remain vulnerable to jailbreak prompts that are fluent and semantically coherent, and therefore difficult to detect with standard heuristics. A particularly challenging failure mode occurs when an attacker tries to hide the malicious goal of their request by manipulating its framing to induce compliance. Because these attacks maintain malicious intent through a flexible presentation, defenses that rely on structural artifacts or goal-specific signatures can fail. Motivated by this, we introduce a self-supervised framework for disentangling semantic factor pairs in LLM activations at inference. We instantiate the framework for goal and framing and construct GoalFrameBench, a corpus of prompts with controlled goal and framing variations, which we use to train Representation Disentanglement on Activations (ReDAct) module to extract disentangled representations…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdversarial Robustness in Machine Learning · Explainable Artificial Intelligence (XAI) · Advanced Malware Detection Techniques