Evaluating Adversarial Vulnerabilities in Modern Large Language Models
Tom Perel

TL;DR
This study compares the vulnerability of two leading large language models, Gemini 2.5 Flash and GPT-4, to jailbreak attacks, revealing significant differences and highlighting the challenges in ensuring LLM safety.
Contribution
It introduces a scalable framework for automated AI red-teaming and provides empirical insights into LLM vulnerabilities and safety measures.
Findings
Gemini 2.5 Flash is more susceptible to jailbreaks than GPT-4.
Cross-bypass attacks are highly effective in exploiting vulnerabilities.
Vulnerabilities are prevalent across different attack methods and content categories.
Abstract
The recent boom and rapid integration of Large Language Models (LLMs) into a wide range of applications warrants a deeper understanding of their security and safety vulnerabilities. This paper presents a comparative analysis of the susceptibility to jailbreak attacks for two leading publicly available LLMs, Google's Gemini 2.5 Flash and OpenAI's GPT-4 (specifically the GPT-4o mini model accessible in the free tier). The research utilized two main bypass strategies: 'self-bypass', where models were prompted to circumvent their own safety protocols, and 'cross-bypass', where one model generated adversarial prompts to exploit vulnerabilities in the other. Four attack methods were employed - direct injection, role-playing, context manipulation, and obfuscation - to generate five distinct categories of unsafe content: hate speech, illegal activities, malicious code, dangerous content, and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdversarial Robustness in Machine Learning · Artificial Intelligence in Healthcare and Education · Hate Speech and Cyberbullying Detection
