Large Language Model Sentinel: LLM Agent for Adversarial Purification
Guang Lin, Toshihisa Tanaka, Qibin Zhao

TL;DR
This paper introduces LLAMOS, a novel adversarial purification method for large language models that uses a defense agent to modify textual inputs minimally, significantly improving robustness against adversarial attacks without requiring adversarial training.
Contribution
The paper presents LLAMOS, a new defense technique employing an agent-based approach to purify adversarial inputs, enhancing LLM robustness without learning from adversarial examples.
Findings
LLAMOS effectively defends against adversarial attacks on various LLMs.
The defense agent maintains high accuracy with minimal input modifications.
Adversarial agents engaged in mutual confrontation do not fully overpower each other.
Abstract
Over the past two years, the use of large language models (LLMs) has advanced rapidly. While these LLMs offer considerable convenience, they also raise security concerns, as LLMs are vulnerable to adversarial attacks by some well-designed textual perturbations. In this paper, we introduce a novel defense technique named Large LAnguage MOdel Sentinel (LLAMOS), which is designed to enhance the adversarial robustness of LLMs by purifying the adversarial textual examples before feeding them into the target LLM. Our method comprises two main components: a) Agent instruction, which can simulate a new agent for adversarial defense, altering minimal characters to maintain the original meaning of the sentence while defending against attacks; b) Defense guidance, which provides strategies for modifying clean or adversarial examples to ensure effective defense and accurate outputs from the target…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdversarial Robustness in Machine Learning · Natural Language Processing Techniques · Topic Modeling
