TAO-Attack: Toward Advanced Optimization-Based Jailbreak Attacks for Large Language Models

Zhi Xu; Jiaqi Li; Xiaotong Zhang; Hong Yu; Han Liu

arXiv:2603.03081·cs.CL·March 4, 2026

TAO-Attack: Toward Advanced Optimization-Based Jailbreak Attacks for Large Language Models

Zhi Xu, Jiaqi Li, Xiaotong Zhang, Hong Yu, Han Liu

PDF

Open Access 3 Reviews

TL;DR

TAO-Attack introduces a novel two-stage optimization approach with a direction-priority token strategy to enhance the effectiveness and efficiency of jailbreak attacks on large language models, achieving near-perfect success rates.

Contribution

It presents a new optimization-based jailbreak method with a two-stage loss function and a token optimization strategy, significantly improving attack success and efficiency over existing methods.

Findings

01

Achieves higher attack success rates, up to 100%.

02

Outperforms state-of-the-art jailbreak methods.

03

Demonstrates effectiveness across multiple LLMs.

Abstract

Large language models (LLMs) have achieved remarkable success across diverse applications but remain vulnerable to jailbreak attacks, where attackers craft prompts that bypass safety alignment and elicit unsafe responses. Among existing approaches, optimization-based attacks have shown strong effectiveness, yet current methods often suffer from frequent refusals, pseudo-harmful outputs, and inefficient token-level updates. In this work, we propose TAO-Attack, a new optimization-based jailbreak method. TAO-Attack employs a two-stage loss function: the first stage suppresses refusals to ensure the model continues harmful prefixes, while the second stage penalizes pseudo-harmful outputs and encourages the model toward more harmful completions. In addition, we design a direction-priority token optimization (DPTO) strategy that improves efficiency by aligning candidates with the gradient…

Peer Reviews

Decision·ICLR 2026 Poster

Reviewer 01Rating 6Confidence 5

Strengths

- The paper is clearly written, and motivates the proposed approach well in a lucid manner. - The paper presents detailed evaluations on multiple LLMs. - The paper propose a novel optimization-based jailbreak method for LLMs that enhances both effectiveness and efficiency, called TAO-Attack. - Experiments across multiple LLMs show that the TAO-Attack surpasses previous jailbreak methods.

Weaknesses

- Including experiments on more datasets would further strengthen the empirical validation and generalizability of the proposed method. - Expanding the jailbreak evaluation to additional models—for instance, Qwen series models—could provide deeper insights into the model-specific robustness and transferability of the approach. - While the current study focuses on jailbreaking LLMs in harmful text generation, it would be valuable to discuss the broader applicability of the proposed techniques to

Reviewer 02Rating 6Confidence 3

Strengths

1. The two-stage design is well-motivated and can produce harmful responses instead of pseudo-harmful ones. 2. I like the analysis of GCG to reveal the need for DPTO. 3. Experiments show a superior performance.

Weaknesses

1. It lacks an in-depth analysis of why Stage Two can indeed encourage the harmful response. In other words, how can we ensure that the suppressed $x_O$ is neither harmful nor desirable? If the $x_O$ is already harmful, this stage will move the optimization away from success. 2. Only one dataset, AdvBench, was used. It would be better to add more datasets, for example, HarmBench. 3. Similarly, only one fixed suffix was evaluated. It would be better to add more suffixes. 4. Only the used iter

Reviewer 03Rating 4Confidence 4

Strengths

- Introducing an explicit refusal-suppression loss is conceptually novel and empirically reduces refusal-style outputs. - DPTO improves optimization stability and convergence speed, and its rationale is theoretically discussed.

Weaknesses

1. Limited coverage of negative samples: The refusal set $R$ is manually curated and may not generalize across models with different refusal behaviors. An automated or semantically driven expansion mechanism would strengthen robustness. 2. Unification and trigger design of the two-stage framework: The two losses share an almost identical structure—each maximizes $x_T$ while penalizing negative examples—with differences only in negative-sample source ($r_j$ vs. $x_O$) and a *Rouge-L*-base

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdversarial Robustness in Machine Learning · Topic Modeling · Artificial Intelligence in Healthcare and Education