Sysformer: Safeguarding Frozen Large Language Models with Adaptive System Prompts

Kartik Sharma; Yiqiao Jin; Vineeth Rakesh; Yingtong Dou; Menghai Pan; Mahashweta Das; Srijan Kumar

arXiv:2506.15751·cs.AI·March 9, 2026

Sysformer: Safeguarding Frozen Large Language Models with Adaptive System Prompts

Kartik Sharma, Yiqiao Jin, Vineeth Rakesh, Yingtong Dou, Menghai Pan, Mahashweta Das, Srijan Kumar

PDF

Open Access 3 Reviews

TL;DR

Sysformer is a novel transformer-based approach that adaptively refines system prompts in instruction-tuned LLMs to improve safety and robustness without altering model parameters, effectively reducing harmful outputs and jailbreaking success.

Contribution

The paper introduces Sysformer, a transformer model that dynamically updates system prompts to enhance LLM safety, representing a new method that avoids costly fine-tuning.

Findings

01

Significantly increases refusal rate to harmful prompts by up to 80%

02

Improves compliance with safe prompts by up to 90%

03

Enhances robustness against jailbreaking attacks by up to 100%

Abstract

As large language models (LLMs) are deployed in safety-critical settings, it is essential to ensure that their responses comply with safety standards. Prior research has revealed that LLMs often fail to grasp the notion of safe behaviors, resulting in either unjustified refusals to harmless prompts or the generation of harmful content. While substantial efforts have been made to improve their robustness, existing defenses often rely on costly fine-tuning of model parameters or employ suboptimal heuristic techniques. In this work, we take a novel approach to safeguard LLMs by learning to adapt the system prompts in instruction-tuned LLMs. While LLMs are typically pre-trained to follow a fixed system prompt, we investigate the impact of tailoring the system prompt to each specific user input on the safety of the responses. To this end, we propose $Sysformer$ , a…

Peer Reviews

Decision·ICLR 2026 Poster

Reviewer 01Rating 6Confidence 3

Strengths

The paper uses multiple benchmarks—JailbreakBench and StrongReject—plus 16 jailbreak variants. The proposed method show a strong empirical performance on these benchmarks, shows that adaptive system prompts can meaningfully improve LLM safety and robustness without modifying model weights.

Weaknesses

While the paper includes solid ablation studies on loss components and demonstrates impressive generalization to unseen jailbreak attack types, it does not assess cross-benchmark transfer — e.g., training Sysformer on JailbreakBench and evaluating on StrongReject (or vice versa). As a result, it remains unclear how well the learned safety behavior generalizes to qualitatively different harmful-prompt distributions. Including such a cross-dataset evaluation (or at least reporting zero-shot transf

Reviewer 02Rating 4Confidence 3

Strengths

+ The paper reads smooth and clear. + The baseline evaluation is rather comprehensive, containing efficienct fine-tuning (LoRA) and embedding space optimization. Dataset selection looks good.

Weaknesses

- The transformer component takes in user prompts, which means the embedding prompt is generated on every query. While the motivation statement criticized efficiency of prior defense methods, Sysformer also introduces overhead but not evaluated. - The traiing loss uses predefined fixed strings like "I cannot help you" as a signal of refusal, which restricts the flexibility of the training method. Not sure if the training pipeline is working on larger and more powerful models that do not answer

Reviewer 03Rating 6Confidence 3

Strengths

1. The proposed method is novel to my knowledge. 2. The defense effectiveness is good. 3. This paper is well written. 4. The defense method does not rely on finetuning the target LLM to be protected.

Weaknesses

1. The baseline methods compared in this paper are very scarce. Many prompt based especially system prompt based defense methods are not discussed or compared at all. 2. The proposed method relies on an additional dataset for training the prompt generation model. It is not clear how the proposed method relies on the size and quality of the training data. In addition, it is unclear whether the proposed method can work for the new attacks which are not covered by the training data. 3. The propos

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdversarial Robustness in Machine Learning · Advanced Malware Detection Techniques · Security and Verification in Computing