SLIP: Soft Label Mechanism and Key-Extraction-Guided CoT-based Defense Against Instruction Backdoor in APIs

Zhengxian Wu; Juan Wen; Wanli Peng; Haowei Chang; Yinghan Zhou; Yiming Xue

arXiv:2508.06153·cs.CR·April 17, 2026

SLIP: Soft Label Mechanism and Key-Extraction-Guided CoT-based Defense Against Instruction Backdoor in APIs

Zhengxian Wu, Juan Wen, Wanli Peng, Haowei Chang, Yinghan Zhou, Yiming Xue

PDF

TL;DR

This paper introduces SLIP, a novel defense mechanism against instruction backdoors in APIs, combining soft label techniques and key-extraction-guided reasoning to improve security and accuracy.

Contribution

SLIP is the first to integrate soft label mechanisms with key-extraction-guided Chain-of-Thought reasoning for backdoor defense in LLMs.

Findings

01

Reduces attack success rate to 25.13%

02

Improves clean accuracy to 87.15%

03

Outperforms existing black-box defenses

Abstract

Customized Large Language Model (LLM) agents face a critical security threat from black-box instruction backdoors, where malicious behaviors are covertly injected through hidden system instructions. Although existing prompt-based defenses can often detect poisoned inputs, they generally fail to recover correct outputs once the backdoor is activated. In this paper, we first conduct a mechanistic analysis of LLM behavior under instruction backdoors and reveal two pivotal phenomena: (1) cognitive override, in which backdoor triggers dominate the reasoning process and suppress task-relevant context, and (2) abnormal semantic correlation, where triggers establish excessively strong semantic associations with attacker-specified target labels. Based on these insights, we propose a $S$ oft $L$ abel mechanism and key-extraction-guided CoT-based defense against…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.