TL;DR
This paper develops a new evaluation framework and anti-jailbreaking system for large language model agents, addressing security threats like many-shot jailbreaking and deceptive alignment, with promising detection accuracy but persistent vulnerabilities under long attacks.
Contribution
It introduces a comprehensive security framework combining detection and countermeasures for LLM agents, emphasizing active monitoring and adaptable interventions to enhance robustness.
Findings
94% detection accuracy on GEMINI 1.5 pro
Persistent vulnerabilities increase with attack length
Active monitoring improves security resilience
Abstract
The autonomous AI agents using large language models can create undeniable values in all span of the society but they face security threats from adversaries that warrants immediate protective solutions because trust and safety issues arise. Considering the many-shot jailbreaking and deceptive alignment as some of the main advanced attacks, that cannot be mitigated by the static guardrails used during the supervised training, points out a crucial research priority for real world robustness. The combination of static guardrails in dynamic multi-agent system fails to defend against those attacks. We intend to enhance security for LLM-based agents through the development of new evaluation frameworks which identify and counter threats for safe operational deployment. Our work uses three examination methods to detect rogue agents through a Reverse Turing Test and analyze deceptive alignment…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
