Human-Readable Adversarial Prompts: An Investigation into LLM Vulnerabilities Using Situational Context

Nilanjana Das; Edward Raff; Aman Chadha; Manas Gaur

arXiv:2412.16359·cs.CL·May 30, 2025

Human-Readable Adversarial Prompts: An Investigation into LLM Vulnerabilities Using Situational Context

Nilanjana Das, Edward Raff, Aman Chadha, Manas Gaur

PDF

Open Access

TL;DR

This paper investigates the vulnerabilities of large language models to human-readable, situational context-based adversarial prompts, demonstrating how skilled attackers can bypass safety measures to generate harmful content.

Contribution

It introduces novel attack methods using movie scripts and gibberish transformations, and enhances the AdvPrompter framework with p-nucleus sampling for more effective adversarial prompt generation.

Findings

01

LLMs can be manipulated with human-readable adversarial prompts.

02

Enhanced attack techniques significantly improve success rates.

03

Models like GPT-3.5-Turbo-0125 and Gemma-7b are vulnerable.

Abstract

As the AI systems become deeply embedded in social media platforms, we've uncovered a concerning security vulnerability that goes beyond traditional adversarial attacks. It becomes important to assess the risks of LLMs before the general public use them on social media platforms to avoid any adverse impacts. Unlike obvious nonsensical text strings that safety systems can easily catch, our work reveals that human-readable situation-driven adversarial full-prompts that leverage situational context are effective but much harder to detect. We found that skilled attackers can exploit the vulnerabilities in open-source and proprietary LLMs to make a malicious user query safe for LLMs, resulting in generating a harmful response. This raises an important question about the vulnerabilities of LLMs. To measure the robustness against human-readable attacks, which now present a potent threat, our…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdversarial Robustness in Machine Learning

Methods15 Ways to Contact How can i speak to someone at Delta Airlines · Cosine Annealing · Weight Decay · Linear Warmup With Cosine Annealing · Linear Layer · Dropout · Softmax · Attention Is All You Need · Attention Dropout · Refunds@Expedia|||How do I get a full refund from Expedia?