Guardians of the Agentic System: Preventing Many Shots Jailbreak with Agentic System

Saikat Barua; Mostafizur Rahman; Md Jafor Sadek; Rafiul Islam; Shehenaz Khaled; Ahmedul Kabir

arXiv:2502.16750·cs.CR·June 13, 2025

Guardians of the Agentic System: Preventing Many Shots Jailbreak with Agentic System

Saikat Barua, Mostafizur Rahman, Md Jafor Sadek, Rafiul Islam, Shehenaz Khaled, Ahmedul Kabir

PDF

1 Repo

TL;DR

This paper develops a new evaluation framework and anti-jailbreaking system for large language model agents, addressing security threats like many-shot jailbreaking and deceptive alignment, with promising detection accuracy but persistent vulnerabilities under long attacks.

Contribution

It introduces a comprehensive security framework combining detection and countermeasures for LLM agents, emphasizing active monitoring and adaptable interventions to enhance robustness.

Findings

01

94% detection accuracy on GEMINI 1.5 pro

02

Persistent vulnerabilities increase with attack length

03

Active monitoring improves security resilience

Abstract

The autonomous AI agents using large language models can create undeniable values in all span of the society but they face security threats from adversaries that warrants immediate protective solutions because trust and safety issues arise. Considering the many-shot jailbreaking and deceptive alignment as some of the main advanced attacks, that cannot be mitigated by the static guardrails used during the supervised training, points out a crucial research priority for real world robustness. The combination of static guardrails in dynamic multi-agent system fails to defend against those attacks. We intend to enhance security for LLM-based agents through the development of new evaluation frameworks which identify and counter threats for safe operational deployment. Our work uses three examination methods to detect rogue agents through a Reverse Turing Test and analyze deceptive alignment…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

GitsSaikat/Guardians-Preventing-Jail-Break-Prompts
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.