You Know What I'm Saying: Jailbreak Attack via Implicit Reference

Tianyu Wu; Lingrui Mei; Ruibin Yuan; Lujun Li; Wei Xue; Yike Guo

arXiv:2410.03857·cs.CL·October 10, 2024

You Know What I'm Saying: Jailbreak Attack via Implicit Reference

Tianyu Wu, Lingrui Mei, Ruibin Yuan, Lujun Li, Wei Xue, Yike Guo

PDF

Open Access 1 Repo 3 Reviews

TL;DR

This paper uncovers a new vulnerability in large language models called Attack via Implicit Reference (AIR), which uses context to bypass detection and achieve high success rates across multiple models, highlighting the need for improved defenses.

Contribution

The study introduces AIR, a novel contextual attack method that bypasses existing detection techniques and demonstrates its effectiveness across various state-of-the-art LLMs.

Findings

01

AIR achieves over 90% attack success rate on multiple models.

02

Larger models are more vulnerable to AIR.

03

Cross-model attack strategy increases malicious content generation.

Abstract

While recent advancements in large language model (LLM) alignment have enabled the effective identification of malicious objectives involving scene nesting and keyword rewriting, our study reveals that these methods remain inadequate at detecting malicious objectives expressed through context within nested harmless objectives. This study identifies a previously overlooked vulnerability, which we term Attack via Implicit Reference (AIR). AIR decomposes a malicious objective into permissible objectives and links them through implicit references within the context. This method employs multiple related harmless objectives to generate malicious content without triggering refusal responses, thereby effectively bypassing existing detection techniques.Our experiments demonstrate AIR's effectiveness across state-of-the-art LLMs, achieving an attack success rate (ASR) exceeding 90% on most…

Peer Reviews

Decision·Submitted to ICLR 2025

Reviewer 01Rating 5Confidence 4

Strengths

1. This paper shows good ASR on SoTA LLMs IN Table 1. 2. The authors also show cross model attack in Table 2 (as explained in the algorithm). 3. The authors compare their attack with baseline defenses and report the rejection rate in Table 5.

Weaknesses

1. There is not much information available regarding the dataset details, like the number of samples during the training approach in the experimental section 4.1. 2. This method makes use of LLM calls thrice in the method - once while rewriting the input prompt and next two times in the first and second stage of attack respectively. Three calls to an LLM for a single prompt is expensive with respect to time taken to rewrite a prompt. The authors should comment on the inference time of their appr

Reviewer 02Rating 6Confidence 4

Strengths

1. The AIR method achieves a high ASR (above 90%) on models with large number of parameters. 2. The Cross-model strategy, wherein, a less secure model is targeted first to create malicious content, which is then used to bypass more secure models, emphasizes the potential transferability of vulnerabilities between LLMs. 3. The paper identifies an inverse relationship between model size and security, highlighting that larger model, typically with enhanced in-context learning capabilities, are more

Weaknesses

1. The success of the AIR method appears to heavily rely on models with high comprehension abilities, specifically in areas like nuanced language understanding, contextual retention, and sophisticated in-context learning. These capabilities are essential because AIR requires the model to maintain and link fragmented, implicitly harmful objectives across multiple interactions without overtly identifying them as malicious. Authors should test this with lighter models (models with fewer parameters)

Reviewer 03Rating 5Confidence 4

Strengths

1. The introduction of "nesting objective generation" is interesting. 2. The paper is easy to follow.

Weaknesses

1. While interesting, as the motivation and basis of your method, Section 2.3 "Nesting Objective Generation" needs more detailed analysis. You could add more analysis about the connection between the attention mechanisms and your proposed concept (nesting objective generation and implicit inferences). For example, you could analyze how the attention mechanism affects the nesting objectives. The existing version makes Section 2.3 more like a guess. 2. This paper needs to be polished in its writi

Code & Models

Repositories

lucas-ty/llm_implicit_reference
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsDeception detection and forensic psychology · Hate Speech and Cyberbullying Detection · Adversarial Robustness in Machine Learning