AlignTree: Efficient Defense Against LLM Jailbreak Attacks
Gil Goren, Shahar Katz, Lior Wolf

TL;DR
AlignTree is a novel, computationally efficient defense mechanism against LLM jailbreak attacks that uses activation monitoring and machine learning classifiers to detect and prevent harmful content generation.
Contribution
We introduce AlignTree, a lightweight, activation-based defense that enhances LLM safety without significant computational costs or auxiliary prompts.
Findings
AlignTree effectively detects misaligned prompts with high accuracy.
The method maintains low computational overhead across various LLMs.
AlignTree demonstrates robustness against diverse jailbreak benchmarks.
Abstract
Large Language Models (LLMs) are vulnerable to adversarial attacks that bypass safety guidelines and generate harmful content. Mitigating these vulnerabilities requires defense mechanisms that are both robust and computationally efficient. However, existing approaches either incur high computational costs or rely on lightweight defenses that can be easily circumvented, rendering them impractical for real-world LLM-based systems. In this work, we introduce the AlignTree defense, which enhances model alignment while maintaining minimal computational overhead. AlignTree monitors LLM activations during generation and detects misaligned behavior using an efficient random forest classifier. This classifier operates on two signals: (i) the refusal direction -- a linear representation that activates on misaligned prompts, and (ii) an SVM-based signal that captures non-linear features associated…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsAdversarial Robustness in Machine Learning · Explainable Artificial Intelligence (XAI) · Advanced Malware Detection Techniques
