Steer Model beyond Assistant: Controlling System Prompt Strength via Contrastive Decoding
Yijiang River Dong, Tiancheng Hu, Zheng Hui, Nigel Collier

TL;DR
This paper presents a training-free method called system prompt strength that uses contrastive decoding to dynamically control large language model behavior, improving adherence to specific system prompts across various tasks.
Contribution
It introduces a novel contrastive decoding technique to modulate system prompt influence without retraining, enhancing control over model behavior.
Findings
Up to +8.5 accuracy on IFEval
+45pp refusal rate on OffTopicEval
+13% steerability on Prompt-Steering
Abstract
Large language models excel at complex instructions yet struggle to deviate from their helpful assistant persona, as post-training instills strong priors that resist conflicting instructions. We introduce system prompt strength, a training-free method that treats prompt adherence as a continuous control. By contrasting logits from target and default system prompts, we isolate and amplify the behavioral signal unique to the target persona by a scalar factor alpha. Across five diverse benchmarks spanning constraint satisfaction, behavioral control, pluralistic alignment, capability modulation, and stylistic control, our method yields substantial improvements: up to +8.5 strict accuracy on IFEval, +45pp refusal rate on OffTopicEval, and +13% steerability on Prompt-Steering. Our approach enables practitioners to modulate system prompt strength, providing dynamic control over model behavior…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Multimodal Machine Learning Applications · Machine Learning in Healthcare
