Enhancing Adversarial Resistance in LLMs with Recursion
Bryan Li, Sounak Bagchi, and Zizhan Wang

TL;DR
This paper introduces a recursive prompt simplification framework to improve the adversarial robustness of Large Language Models, aiming to enhance AI safety by making malicious prompts more detectable and preventable.
Contribution
It proposes a novel recursive approach that increases prompt transparency, aiding in the detection and mitigation of adversarial inputs in LLMs.
Findings
Improved detection of malicious prompts
Enhanced robustness against adversarial attacks
Foundation for safer LLM deployment
Abstract
The increasing integration of Large Language Models (LLMs) into society necessitates robust defenses against vulnerabilities from jailbreaking and adversarial prompts. This project proposes a recursive framework for enhancing the resistance of LLMs to manipulation through the use of prompt simplification techniques. By increasing the transparency of complex and confusing adversarial prompts, the proposed method enables more reliable detection and prevention of malicious inputs. Our findings attempt to address a critical problem in AI safety and security, providing a foundation for the development of systems able to distinguish harmless inputs from prompts containing malicious intent. As LLMs continue to be used in diverse applications, the importance of such safeguards will only grow.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSecurity and Verification in Computing · Cryptography and Data Security
