SecAlign: Defending Against Prompt Injection with Preference Optimization

Sizhe Chen; Arman Zharmagambetov; Saeed Mahloujifar; Kamalika Chaudhuri; David Wagner; Chuan Guo

arXiv:2410.05451·cs.CR·July 4, 2025

SecAlign: Defending Against Prompt Injection with Preference Optimization

Sizhe Chen, Arman Zharmagambetov, Saeed Mahloujifar, Kamalika Chaudhuri, David Wagner, Chuan Guo

PDF

Open Access 1 Repo

TL;DR

SecAlign is a novel defense mechanism that uses preference optimization to significantly reduce prompt injection success rates in large language models, maintaining utility while enhancing security against sophisticated attacks.

Contribution

We introduce SecAlign, the first method employing preference optimization to defend against prompt injection in LLMs, achieving less than 10% success rate against various attacks.

Findings

01

Reduces prompt injection success to <10%

02

Maintains similar utility to original models

03

Generalizes well to unseen attacks

Abstract

Large language models (LLMs) are becoming increasingly prevalent in modern software systems, interfacing between the user and the Internet to assist with tasks that require advanced language understanding. To accomplish these tasks, the LLM often uses external data sources such as user documents, web retrieval, results from API calls, etc. This opens up new avenues for attackers to manipulate the LLM via prompt injection. Adversarial prompts can be injected into external data sources to override the system's intended instruction and instead execute a malicious instruction. To mitigate this vulnerability, we propose a new defense called SecAlign based on the technique of preference optimization. Our defense first constructs a preference dataset with prompt-injected inputs, secure outputs (ones that respond to the legitimate instruction), and insecure outputs (ones that respond to the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

facebookresearch/secalign
jaxOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques