Prompt Injection as an Emerging Threat: Evaluating the Resilience of Large Language Models
Daniyal Ganiuly, Assel Smaiyl

TL;DR
This paper introduces a unified framework with three metrics to evaluate the robustness of large language models against prompt injection attacks, revealing that safety tuning enhances resilience more than size.
Contribution
It proposes a comprehensive evaluation framework for prompt injection resilience and demonstrates the importance of safety tuning over model size for robustness.
Findings
GPT-4 shows the highest resilience among models tested
Open-source models are more vulnerable to prompt injection
Safety tuning significantly improves model robustness
Abstract
Large Language Models (LLMs) are increasingly used in intelligent systems that perform reasoning, summarization, and code generation. Their ability to follow natural-language instructions, while powerful, also makes them vulnerable to a new class of attacks known as prompt injection. In these attacks, hidden or malicious instructions are inserted into user inputs or external content, causing the model to ignore its intended task or produce unsafe responses. This study proposes a unified framework for evaluating how resistant Large Language Models (LLMs) are to prompt injection attacks. The framework defines three complementary metrics such as the Resilience Degradation Index (RDI), Safety Compliance Coefficient (SCC), and Instructional Integrity Metric (IIM) to jointly measure robustness, safety, and semantic stability. We evaluated four instruction-tuned models (GPT-4, GPT-4o, LLaMA-3…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdversarial Robustness in Machine Learning · Topic Modeling · Artificial Intelligence in Healthcare and Education
