Enhancing Adversarial Attacks through Chain of Thought

Jingbo Su

arXiv:2410.21791·cs.CL·October 30, 2024

Enhancing Adversarial Attacks through Chain of Thought

Jingbo Su

PDF

Open Access 1 Repo

TL;DR

This paper introduces a novel adversarial attack method on large language models by combining chain of thought prompts with gradient techniques, significantly improving attack transferability and robustness.

Contribution

It proposes the CoT-GCG approach, integrating chain of thought prompts with gradient-based attacks to enhance adversarial transferability against aligned LLMs.

Findings

01

CoT-GCG outperforms baseline GCG and CoT prompting in attack success.

02

Using CoT triggers stimulates reasoning, improving attack transferability.

03

The approach provides better risk assessment of harmful interactions.

Abstract

Large language models (LLMs) have demonstrated impressive performance across various domains but remain susceptible to safety concerns. Prior research indicates that gradient-based adversarial attacks are particularly effective against aligned LLMs and the chain of thought (CoT) prompting can elicit desired answers through step-by-step reasoning. This paper proposes enhancing the robustness of adversarial attacks on aligned LLMs by integrating CoT prompts with the greedy coordinate gradient (GCG) technique. Using CoT triggers instead of affirmative targets stimulates the reasoning abilities of backend LLMs, thereby improving the transferability and universality of adversarial attacks. We conducted an ablation study comparing our CoT-GCG approach with Amazon Web Services auto-cot. Results revealed our approach outperformed both the baseline GCG attack and CoT prompting. Additionally, we…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

sujingbo0217/cs222w24-llm-attack
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdversarial Robustness in Machine Learning · Mental Health Research Topics · Advanced Malware Detection Techniques

MethodsLLaMA