A methodological analysis of prompt perturbations and their effect on attack success rates

Tiago Machado; Maysa Malfiza Garcia de Macedo; Rogerio Abreu de Paula; Marcelo Carpinette Grave; Aminat Adebiyi; Luan Soares de Souza; Enrico Santarelli; Claudio Pinhanez

arXiv:2511.10686·cs.CL·November 17, 2025

A methodological analysis of prompt perturbations and their effect on attack success rates

Tiago Machado, Maysa Malfiza Garcia de Macedo, Rogerio Abreu de Paula, Marcelo Carpinette Grave, Aminat Adebiyi, Luan Soares de Souza, Enrico Santarelli, Claudio Pinhanez

PDF

Open Access

TL;DR

This paper systematically analyzes how different LLM alignment methods influence the models' vulnerability to prompt attacks, revealing that small prompt changes can significantly alter attack success rates and highlighting limitations of current benchmark evaluations.

Contribution

It provides a statistical, systematic comparison of alignment methods' impact on attack susceptibility and emphasizes the importance of prompt variation in vulnerability assessment.

Findings

01

Small prompt modifications significantly affect attack success rates.

02

Different alignment methods show varying susceptibility to prompt attacks.

03

Current benchmarks may not fully capture model vulnerabilities.

Abstract

This work aims to investigate how different Large Language Models (LLMs) alignment methods affect the models' responses to prompt attacks. We selected open source models based on the most common alignment methods, namely, Supervised Fine-Tuning (SFT), Direct Preference Optimization (DPO), and Reinforcement Learning with Human Feedback (RLHF). We conducted a systematic analysis using statistical methods to verify how sensitive the Attack Success Rate (ASR) is when we apply variations to prompts designed to elicit inappropriate content from LLMs. Our results show that even small prompt modifications can significantly change the Attack Success Rate (ASR) according to the statistical tests we run, making the models more or less susceptible to types of attack. Critically, our results demonstrate that running existing 'attack benchmarks' alone may not be sufficient to elicit all possible…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdversarial Robustness in Machine Learning · Information and Cyber Security · Advanced Malware Detection Techniques