AlignTree: Efficient Defense Against LLM Jailbreak Attacks

Gil Goren; Shahar Katz; Lior Wolf

arXiv:2511.12217·cs.LG·November 18, 2025

AlignTree: Efficient Defense Against LLM Jailbreak Attacks

Gil Goren, Shahar Katz, Lior Wolf

PDF

Open Access 1 Video

TL;DR

AlignTree is a novel, computationally efficient defense mechanism against LLM jailbreak attacks that uses activation monitoring and machine learning classifiers to detect and prevent harmful content generation.

Contribution

We introduce AlignTree, a lightweight, activation-based defense that enhances LLM safety without significant computational costs or auxiliary prompts.

Findings

01

AlignTree effectively detects misaligned prompts with high accuracy.

02

The method maintains low computational overhead across various LLMs.

03

AlignTree demonstrates robustness against diverse jailbreak benchmarks.

Abstract

Large Language Models (LLMs) are vulnerable to adversarial attacks that bypass safety guidelines and generate harmful content. Mitigating these vulnerabilities requires defense mechanisms that are both robust and computationally efficient. However, existing approaches either incur high computational costs or rely on lightweight defenses that can be easily circumvented, rendering them impractical for real-world LLM-based systems. In this work, we introduce the AlignTree defense, which enhances model alignment while maintaining minimal computational overhead. AlignTree monitors LLM activations during generation and detects misaligned behavior using an efficient random forest classifier. This classifier operates on two signals: (i) the refusal direction -- a linear representation that activates on misaligned prompts, and (ii) an SVM-based signal that captures non-linear features associated…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

AlignTree: Efficient Defense Against LLM Jailbreak Attacks· underline

Taxonomy

TopicsAdversarial Robustness in Machine Learning · Explainable Artificial Intelligence (XAI) · Advanced Malware Detection Techniques