Keep Security! Benchmarking Security Policy Preservation in Large Language Model Contexts Against Indirect Attacks in Question Answering
Hwan Chang, Yumin Kim, Yonghyun Jun, Hwanhee Lee

TL;DR
This paper introduces CoPriva, a large-scale benchmark dataset to evaluate LLMs' ability to adhere to security policies in question answering, revealing significant vulnerabilities especially against indirect attacks.
Contribution
The paper presents CoPriva, the first comprehensive benchmark dataset for assessing security policy preservation in LLMs against indirect attacks in context.
Findings
Many LLMs violate user-defined security policies.
Models struggle to incorporate policies during generation.
Explicit prompts can partially revise outputs.
Abstract
As Large Language Models (LLMs) are increasingly deployed in sensitive domains such as enterprise and government, ensuring that they adhere to user-defined security policies within context is critical-especially with respect to information non-disclosure. While prior LLM studies have focused on general safety and socially sensitive data, large-scale benchmarks for contextual security preservation against attacks remain lacking. To address this, we introduce a novel large-scale benchmark dataset, CoPriva, evaluating LLM adherence to contextual non-disclosure policies in question answering. Derived from realistic contexts, our dataset includes explicit policies and queries designed as direct and challenging indirect attacks seeking prohibited information. We evaluate 10 LLMs on our benchmark and reveal a significant vulnerability: many models violate user-defined policies and leak…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsTopic Modeling · Access Control and Trust · Natural Language Processing Techniques
