Hiding in Plain Text: Detecting Concealed Jailbreaks via Activation Disentanglement
Amirhossein Farzam, Majid Behabahani, Mani Malek, Yuriy Nevmyvaka, Guillermo Sapiro

TL;DR
This paper presents a novel self-supervised framework for detecting hidden malicious prompts in large language models by disentangling semantic factors in activations, improving robustness against framing manipulations.
Contribution
It introduces ReDAct, a representation disentanglement module, and FrameShield, an anomaly detector, to identify concealed jailbreak prompts across multiple LLMs.
Findings
ReDAct effectively disentangles goal and framing signals in LLM activations.
FrameShield improves detection accuracy with minimal computational cost.
Disentanglement reveals distinct goal and framing profiles, aiding interpretability.
Abstract
Large language models (LLMs) remain vulnerable to jailbreak prompts that are fluent and semantically coherent, and therefore difficult to detect with standard heuristics. A particularly challenging failure mode occurs when an attacker tries to hide the malicious goal of their request by manipulating its framing to induce compliance. Because these attacks maintain malicious intent through a flexible presentation, defenses that rely on structural artifacts or goal-specific signatures can fail. Motivated by this, we introduce a self-supervised framework for disentangling semantic factor pairs in LLM activations at inference. We instantiate the framework for goal and framing and construct GoalFrameBench, a corpus of prompts with controlled goal and framing variations, which we use to train Representation Disentanglement on Activations (ReDAct) module to extract disentangled representations…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdversarial Robustness in Machine Learning · Explainable Artificial Intelligence (XAI) · Advanced Malware Detection Techniques
