Kov: Transferable and Naturalistic Black-Box LLM Attacks using Markov   Decision Processes and Tree Search

Robert J. Moss

arXiv:2408.08899·cs.CR·August 20, 2024

Kov: Transferable and Naturalistic Black-Box LLM Attacks using Markov Decision Processes and Tree Search

Robert J. Moss

PDF

Open Access 1 Repo

TL;DR

This paper introduces Kov, a novel method using Markov decision processes and tree search to efficiently generate naturalistic adversarial prompts that can jailbreak black-box LLMs like GPT-3.5, improving safety testing.

Contribution

Kov is the first approach to frame black-box LLM attack generation as an MDP with tree search, optimizing for natural language attacks and demonstrating effectiveness on real models.

Findings

01

Successfully jailbreaks GPT-3.5 in 10 queries

02

Fails to jailbreak GPT-4, indicating increased robustness

03

Uses naturalistic loss for more interpretable attacks

Abstract

Eliciting harmful behavior from large language models (LLMs) is an important task to ensure the proper alignment and safety of the models. Often when training LLMs, ethical guidelines are followed yet alignment failures may still be uncovered through red teaming adversarial attacks. This work frames the red-teaming problem as a Markov decision process (MDP) and uses Monte Carlo tree search to find harmful behaviors of black-box, closed-source LLMs. We optimize token-level prompt suffixes towards targeted harmful behaviors on white-box LLMs and include a naturalistic loss term, log-perplexity, to generate more natural language attacks for better interpretability. The proposed algorithm, Kov, trains on white-box LLMs to optimize the adversarial attacks and periodically evaluates responses from the black-box LLM to guide the search towards more harmful black-box behaviors. In our…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

sisl/kov.jl
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNetwork Security and Intrusion Detection

MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · {Dispute@FaQ-s}How to file a dispute with Expedia? · Attention Is All You Need · Cosine Annealing · Adam · Weight Decay · Dense Connections · Byte Pair Encoding · Softmax · Linear Layer