UniGuardian: A Unified Defense for Detecting Prompt Injection, Backdoor   Attacks and Adversarial Attacks in Large Language Models

Huawei Lin; Yingjie Lao; Tong Geng; Tan Yu; Weijie Zhao

arXiv:2502.13141·cs.CL·February 19, 2025

UniGuardian: A Unified Defense for Detecting Prompt Injection, Backdoor Attacks and Adversarial Attacks in Large Language Models

Huawei Lin, Yingjie Lao, Tong Geng, Tan Yu, Weijie Zhao

PDF

Open Access 1 Repo

TL;DR

UniGuardian is a novel unified defense mechanism that detects various prompt trigger attacks in large language models efficiently within a single forward pass, enhancing security against multiple attack types.

Contribution

This paper introduces UniGuardian, the first unified detection method for prompt injection, backdoor, and adversarial attacks in LLMs, with a single-forward detection strategy.

Findings

01

Accurately detects malicious prompts in LLMs

02

Efficiently identifies multiple attack types simultaneously

03

Optimizes detection with a single-forward strategy

Abstract

Large Language Models (LLMs) are vulnerable to attacks like prompt injection, backdoor attacks, and adversarial attacks, which manipulate prompts or models to generate harmful outputs. In this paper, departing from traditional deep learning attack paradigms, we explore their intrinsic relationship and collectively term them Prompt Trigger Attacks (PTA). This raises a key question: Can we determine if a prompt is benign or poisoned? To address this, we propose UniGuardian, the first unified defense mechanism designed to detect prompt injection, backdoor attacks, and adversarial attacks in LLMs. Additionally, we introduce a single-forward strategy to optimize the detection pipeline, enabling simultaneous attack detection and text generation within a single forward pass. Our experiments confirm that UniGuardian accurately and efficiently identifies malicious prompts in LLMs.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

huawei-lin/uniguardian
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdversarial Robustness in Machine Learning · Topic Modeling · Natural Language Processing Techniques