Sysformer: Safeguarding Frozen Large Language Models with Adaptive System Prompts
Kartik Sharma, Yiqiao Jin, Vineeth Rakesh, Yingtong Dou, Menghai Pan, Mahashweta Das, Srijan Kumar

TL;DR
Sysformer is a novel transformer-based approach that adaptively refines system prompts in instruction-tuned LLMs to improve safety and robustness without altering model parameters, effectively reducing harmful outputs and jailbreaking success.
Contribution
The paper introduces Sysformer, a transformer model that dynamically updates system prompts to enhance LLM safety, representing a new method that avoids costly fine-tuning.
Findings
Significantly increases refusal rate to harmful prompts by up to 80%
Improves compliance with safe prompts by up to 90%
Enhances robustness against jailbreaking attacks by up to 100%
Abstract
As large language models (LLMs) are deployed in safety-critical settings, it is essential to ensure that their responses comply with safety standards. Prior research has revealed that LLMs often fail to grasp the notion of safe behaviors, resulting in either unjustified refusals to harmless prompts or the generation of harmful content. While substantial efforts have been made to improve their robustness, existing defenses often rely on costly fine-tuning of model parameters or employ suboptimal heuristic techniques. In this work, we take a novel approach to safeguard LLMs by learning to adapt the system prompts in instruction-tuned LLMs. While LLMs are typically pre-trained to follow a fixed system prompt, we investigate the impact of tailoring the system prompt to each specific user input on the safety of the responses. To this end, we propose , a…
Peer Reviews
Decision·ICLR 2026 Poster
The paper uses multiple benchmarks—JailbreakBench and StrongReject—plus 16 jailbreak variants. The proposed method show a strong empirical performance on these benchmarks, shows that adaptive system prompts can meaningfully improve LLM safety and robustness without modifying model weights.
While the paper includes solid ablation studies on loss components and demonstrates impressive generalization to unseen jailbreak attack types, it does not assess cross-benchmark transfer — e.g., training Sysformer on JailbreakBench and evaluating on StrongReject (or vice versa). As a result, it remains unclear how well the learned safety behavior generalizes to qualitatively different harmful-prompt distributions. Including such a cross-dataset evaluation (or at least reporting zero-shot transf
+ The paper reads smooth and clear. + The baseline evaluation is rather comprehensive, containing efficienct fine-tuning (LoRA) and embedding space optimization. Dataset selection looks good.
- The transformer component takes in user prompts, which means the embedding prompt is generated on every query. While the motivation statement criticized efficiency of prior defense methods, Sysformer also introduces overhead but not evaluated. - The traiing loss uses predefined fixed strings like "I cannot help you" as a signal of refusal, which restricts the flexibility of the training method. Not sure if the training pipeline is working on larger and more powerful models that do not answer
1. The proposed method is novel to my knowledge. 2. The defense effectiveness is good. 3. This paper is well written. 4. The defense method does not rely on finetuning the target LLM to be protected.
1. The baseline methods compared in this paper are very scarce. Many prompt based especially system prompt based defense methods are not discussed or compared at all. 2. The proposed method relies on an additional dataset for training the prompt generation model. It is not clear how the proposed method relies on the size and quality of the training data. In addition, it is unclear whether the proposed method can work for the new attacks which are not covered by the training data. 3. The propos
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdversarial Robustness in Machine Learning · Advanced Malware Detection Techniques · Security and Verification in Computing
