Human-Readable Adversarial Prompts: An Investigation into LLM Vulnerabilities Using Situational Context
Nilanjana Das, Edward Raff, Aman Chadha, Manas Gaur

TL;DR
This paper investigates the vulnerabilities of large language models to human-readable, situational context-based adversarial prompts, demonstrating how skilled attackers can bypass safety measures to generate harmful content.
Contribution
It introduces novel attack methods using movie scripts and gibberish transformations, and enhances the AdvPrompter framework with p-nucleus sampling for more effective adversarial prompt generation.
Findings
LLMs can be manipulated with human-readable adversarial prompts.
Enhanced attack techniques significantly improve success rates.
Models like GPT-3.5-Turbo-0125 and Gemma-7b are vulnerable.
Abstract
As the AI systems become deeply embedded in social media platforms, we've uncovered a concerning security vulnerability that goes beyond traditional adversarial attacks. It becomes important to assess the risks of LLMs before the general public use them on social media platforms to avoid any adverse impacts. Unlike obvious nonsensical text strings that safety systems can easily catch, our work reveals that human-readable situation-driven adversarial full-prompts that leverage situational context are effective but much harder to detect. We found that skilled attackers can exploit the vulnerabilities in open-source and proprietary LLMs to make a malicious user query safe for LLMs, resulting in generating a harmful response. This raises an important question about the vulnerabilities of LLMs. To measure the robustness against human-readable attacks, which now present a potent threat, our…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdversarial Robustness in Machine Learning
Methods15 Ways to Contact How can i speak to someone at Delta Airlines · Cosine Annealing · Weight Decay · Linear Warmup With Cosine Annealing · Linear Layer · Dropout · Softmax · Attention Is All You Need · Attention Dropout · Refunds@Expedia|||How do I get a full refund from Expedia?
