Large Language Model Sentinel: LLM Agent for Adversarial Purification

Guang Lin; Toshihisa Tanaka; Qibin Zhao

arXiv:2405.20770·cs.CL·April 24, 2025

Large Language Model Sentinel: LLM Agent for Adversarial Purification

Guang Lin, Toshihisa Tanaka, Qibin Zhao

PDF

Open Access

TL;DR

This paper introduces LLAMOS, a novel adversarial purification method for large language models that uses a defense agent to modify textual inputs minimally, significantly improving robustness against adversarial attacks without requiring adversarial training.

Contribution

The paper presents LLAMOS, a new defense technique employing an agent-based approach to purify adversarial inputs, enhancing LLM robustness without learning from adversarial examples.

Findings

01

LLAMOS effectively defends against adversarial attacks on various LLMs.

02

The defense agent maintains high accuracy with minimal input modifications.

03

Adversarial agents engaged in mutual confrontation do not fully overpower each other.

Abstract

Over the past two years, the use of large language models (LLMs) has advanced rapidly. While these LLMs offer considerable convenience, they also raise security concerns, as LLMs are vulnerable to adversarial attacks by some well-designed textual perturbations. In this paper, we introduce a novel defense technique named Large LAnguage MOdel Sentinel (LLAMOS), which is designed to enhance the adversarial robustness of LLMs by purifying the adversarial textual examples before feeding them into the target LLM. Our method comprises two main components: a) Agent instruction, which can simulate a new agent for adversarial defense, altering minimal characters to maintain the original meaning of the sentence while defending against attacks; b) Defense guidance, which provides strategies for modifying clean or adversarial examples to ensure effective defense and accurate outputs from the target…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdversarial Robustness in Machine Learning · Natural Language Processing Techniques · Topic Modeling