F2A: An Innovative Approach for Prompt Injection by Utilizing Feign Security Detection Agents
Yupeng Ren

TL;DR
This paper introduces F2A, a novel attack exploiting LLMs' blind trust in safety detection agents, demonstrating how malicious fake results can hijack conversations and proposing solutions to enhance LLM security.
Contribution
The paper presents the Feign Agent Attack (F2A), revealing a new vulnerability in LLM safety mechanisms and offering strategies to mitigate this security risk.
Findings
F2A can successfully hijack LLM conversations using fake safety results
LLMs tend to trust safety detection outputs without critical evaluation
Proposed solutions improve LLM robustness against F2A attacks
Abstract
With the rapid development of Large Language Models (LLMs), numerous mature applications of LLMs have emerged in the field of content safety detection. However, we have found that LLMs exhibit blind trust in safety detection agents. The general LLMs can be compromised by hackers with this vulnerability. Hence, this paper proposed an attack named Feign Agent Attack (F2A).Through such malicious forgery methods, adding fake safety detection results into the prompt, the defense mechanism of LLMs can be bypassed, thereby obtaining harmful content and hijacking the normal conversation. Continually, a series of experiments were conducted. In these experiments, the hijacking capability of F2A on LLMs was analyzed and demonstrated, exploring the fundamental reasons why LLMs blindly trust safety detection results. The experiments involved various scenarios where fake safety detection results were…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNetwork Security and Intrusion Detection
