Automated Framework to Evaluate and Harden LLM System Instructions against Encoding Attacks
Anubhab Sahu, Diptisha Samanta, Reza Soosahabi

TL;DR
This paper presents an automated framework to evaluate and improve the security of LLM system instructions against encoding attacks, demonstrating that subtle instruction changes can significantly reduce leakage risks.
Contribution
It introduces a novel automated evaluation method for instruction security and proposes a mitigation strategy using instruction reshaping with Chain-of-Thought reasoning.
Findings
High attack success rates (> 0.7) in structured serialization formats.
One-shot instruction reshaping reduces attack success rate significantly.
Subtle wording and structural changes can enhance instruction confidentiality.
Abstract
System Instructions in Large Language Models (LLMs) are commonly used to enforce safety policies, define agent behavior, and protect sensitive operational context in agentic AI applications. These instructions may contain sensitive information such as API credentials, internal policies, and privileged workflow definitions, making system instruction leakage a critical security risk highlighted in the OWASP Top 10 for LLM Applications. Without incurring the overhead costs of reasoning models, many LLM applications rely on refusal-based instructions that block direct requests for system instructions, implicitly assuming that prohibited information can only be extracted through explicit queries. We introduce an automated evaluation framework that tests whether system instructions remain confidential when extraction requests are re-framed as encoding or structured output tasks. Across four…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
