Misusing Tools in Large Language Models With Visual Adversarial Examples
Xiaohan Fu, Zihan Wang, Shuheng Li, Rajesh K. Gupta, Niloofar, Mireshghallah, Taylor Berg-Kirkpatrick, Earlence Fernandes

TL;DR
This paper demonstrates that visual adversarial examples can stealthily manipulate large language models with multimodal capabilities to perform malicious tool usage, affecting security without disrupting user interactions.
Contribution
It introduces a novel attack method using visual adversarial examples to compromise LLM tool usage, highlighting security vulnerabilities in multimodal models.
Findings
Adversarial images cause ~98% tool invocation success.
High similarity (~0.9 SSIM) maintained with clean images.
Attacks do not significantly alter conversation semantics.
Abstract
Large Language Models (LLMs) are being enhanced with the ability to use tools and to process multiple modalities. These new capabilities bring new benefits and also new security risks. In this work, we show that an attacker can use visual adversarial examples to cause attacker-desired tool usage. For example, the attacker could cause a victim LLM to delete calendar events, leak private conversations and book hotels. Different from prior work, our attacks can affect the confidentiality and integrity of user resources connected to the LLM while being stealthy and generalizable to multiple input prompts. We construct these attacks using gradient-based adversarial training and characterize performance along multiple dimensions. We find that our adversarial images can manipulate the LLM to invoke tools following real-world syntax almost always (~98%) while maintaining high similarity to…
Peer Reviews
Decision·Submitted to ICLR 2024
- In general, this paper is well-structured and easy to follow. - I believe the problem this paper addresses is highly significant. It focuses on understanding how to attack systems using LLMs in real-world scenarios, which presents new challenges when viewed from a systemic perspective. - The experimental results demonstrate that malicious instructions can be generated by perturbing the input image.
- This paper assumes that interaction with the tools occurs through an instruction line, followed by normal question answering, as illustrated in Figure 1. Is this setting realistic? What does a real system look like, and how do these VLMs interact with downstream tools like email? Please provide an illustration of why the task in Figure 1 is realistic. - This paper lacks technical contributions and depth. The technical contribution of this paper is to generate perturbations on the image side th
1. The big picture of the paper is sound. Indeed, as LLMs are integrated into applications, critical resources may be controlled by the models. Then, attacks on the models can induce broad implications beyond just the misalignment moral values. The threat model and the real-world risk analysis in this paper are quite insightful. 2. The approach is simple and effective. 3. The authors make efforts to collect evaluation datasets as well as comprehensive human evaluation.
1. **Only a single model LLaMA Adapter is tested.** This makes the scope of the evaluation look somewhat narrow. I suggest the authors also consider other VLMs like Minigpt-4 [1], Instruct-Blip [2], and LLaVA [3]. This can make the evaluation more convincing. 2. **Lack of case studies on real LLM-integrated applications.** The paper mentioned that LangChain and Guidance facilitate the development of such integrations. But, the paper did not provide a single instance of this to illustrate the
1. The paper studies a new and timely security issue of LLM. 2. The proposed method achieves better stealthiness over prior works.
1. The proposed method seems to be straightforward, which is essentially the way of injecting a backdoor. It is unclear how this method can help generalize the trigger to other prompts. 2. Since the goal is to trigger the misusage of tools, why limiting the adversarial perturbation and enhancing the generalizability are important? The adversary only needs one prompt to trigger the malicious usage of the tools. 3. It is unclear what are the implications of these 5 attack objectives. Does the
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdversarial Robustness in Machine Learning · Topic Modeling
