Enhancing Adversarial Resistance in LLMs with Recursion

Bryan Li; Sounak Bagchi; and Zizhan Wang

arXiv:2412.06181·cs.CR·December 10, 2024

Enhancing Adversarial Resistance in LLMs with Recursion

Bryan Li, Sounak Bagchi, and Zizhan Wang

PDF

Open Access

TL;DR

This paper introduces a recursive prompt simplification framework to improve the adversarial robustness of Large Language Models, aiming to enhance AI safety by making malicious prompts more detectable and preventable.

Contribution

It proposes a novel recursive approach that increases prompt transparency, aiding in the detection and mitigation of adversarial inputs in LLMs.

Findings

01

Improved detection of malicious prompts

02

Enhanced robustness against adversarial attacks

03

Foundation for safer LLM deployment

Abstract

The increasing integration of Large Language Models (LLMs) into society necessitates robust defenses against vulnerabilities from jailbreaking and adversarial prompts. This project proposes a recursive framework for enhancing the resistance of LLMs to manipulation through the use of prompt simplification techniques. By increasing the transparency of complex and confusing adversarial prompts, the proposed method enables more reliable detection and prevention of malicious inputs. Our findings attempt to address a critical problem in AI safety and security, providing a foundation for the development of systems able to distinguish harmless inputs from prompts containing malicious intent. As LLMs continue to be used in diverse applications, the importance of such safeguards will only grow.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSecurity and Verification in Computing · Cryptography and Data Security