A Closer Look at System Prompt Robustness
Norman Mu, Jonathan Lu, Michael Lavery, David Wagner

TL;DR
This paper investigates the robustness of system prompts in large language models, proposing new datasets and methods to improve adherence to prompts, and evaluates the effectiveness of fine-tuning and inference techniques.
Contribution
It introduces realistic evaluation datasets and assesses various fine-tuning and inference methods to enhance system prompt robustness in LLMs.
Findings
Fine-tuning with realistic data improves robustness.
Inference-time interventions like classifier-free guidance help.
Current techniques still fall short of full robustness.
Abstract
System prompts have emerged as a critical control surface for specifying the behavior of LLMs in chat and agent settings. Developers depend on system prompts to specify important context, output format, personalities, guardrails, content policies, and safety countermeasures, all of which require models to robustly adhere to the system prompt, especially when facing conflicting or adversarial user inputs. In practice, models often forget to consider relevant guardrails or fail to resolve conflicting demands between the system and the user. In this work, we study various methods for improving system prompt robustness by creating realistic new evaluation and fine-tuning datasets based on prompts collected from from OpenAI's GPT Store and HuggingFace's HuggingChat. Our experiments assessing models with a panel of new and existing benchmarks show that performance can be considerably improved…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsFault Detection and Control Systems
