Keep Security! Benchmarking Security Policy Preservation in Large Language Model Contexts Against Indirect Attacks in Question Answering

Hwan Chang; Yumin Kim; Yonghyun Jun; Hwanhee Lee

arXiv:2505.15805·cs.CL·September 17, 2025

Keep Security! Benchmarking Security Policy Preservation in Large Language Model Contexts Against Indirect Attacks in Question Answering

Hwan Chang, Yumin Kim, Yonghyun Jun, Hwanhee Lee

PDF

Open Access 1 Repo 1 Video

TL;DR

This paper introduces CoPriva, a large-scale benchmark dataset to evaluate LLMs' ability to adhere to security policies in question answering, revealing significant vulnerabilities especially against indirect attacks.

Contribution

The paper presents CoPriva, the first comprehensive benchmark dataset for assessing security policy preservation in LLMs against indirect attacks in context.

Findings

01

Many LLMs violate user-defined security policies.

02

Models struggle to incorporate policies during generation.

03

Explicit prompts can partially revise outputs.

Abstract

As Large Language Models (LLMs) are increasingly deployed in sensitive domains such as enterprise and government, ensuring that they adhere to user-defined security policies within context is critical-especially with respect to information non-disclosure. While prior LLM studies have focused on general safety and socially sensitive data, large-scale benchmarks for contextual security preservation against attacks remain lacking. To address this, we introduce a novel large-scale benchmark dataset, CoPriva, evaluating LLM adherence to contextual non-disclosure policies in question answering. Derived from realistic contexts, our dataset includes explicit policies and queries designed as direct and challenging indirect attacks seeking prohibited information. We evaluate 10 LLMs on our benchmark and reveal a significant vulnerability: many models violate user-defined policies and leak…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

hwanchang00/CoPriva
noneOfficial

Videos

Keep Security! Benchmarking Security Policy Preservation in Large Language Model Contexts Against Indirect Attacks in Question Answering· underline

Taxonomy

TopicsTopic Modeling · Access Control and Trust · Natural Language Processing Techniques