Evaluating Adversarial Vulnerabilities in Modern Large Language Models

Tom Perel

arXiv:2511.17666·cs.CR·November 25, 2025

Evaluating Adversarial Vulnerabilities in Modern Large Language Models

Tom Perel

PDF

Open Access

TL;DR

This study compares the vulnerability of two leading large language models, Gemini 2.5 Flash and GPT-4, to jailbreak attacks, revealing significant differences and highlighting the challenges in ensuring LLM safety.

Contribution

It introduces a scalable framework for automated AI red-teaming and provides empirical insights into LLM vulnerabilities and safety measures.

Findings

01

Gemini 2.5 Flash is more susceptible to jailbreaks than GPT-4.

02

Cross-bypass attacks are highly effective in exploiting vulnerabilities.

03

Vulnerabilities are prevalent across different attack methods and content categories.

Abstract

The recent boom and rapid integration of Large Language Models (LLMs) into a wide range of applications warrants a deeper understanding of their security and safety vulnerabilities. This paper presents a comparative analysis of the susceptibility to jailbreak attacks for two leading publicly available LLMs, Google's Gemini 2.5 Flash and OpenAI's GPT-4 (specifically the GPT-4o mini model accessible in the free tier). The research utilized two main bypass strategies: 'self-bypass', where models were prompted to circumvent their own safety protocols, and 'cross-bypass', where one model generated adversarial prompts to exploit vulnerabilities in the other. Four attack methods were employed - direct injection, role-playing, context manipulation, and obfuscation - to generate five distinct categories of unsafe content: hate speech, illegal activities, malicious code, dangerous content, and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdversarial Robustness in Machine Learning · Artificial Intelligence in Healthcare and Education · Hate Speech and Cyberbullying Detection