TraceGuard: Process-Guided Firewall against Reasoning Backdoors in Large Language Models

Zhen Guo; Shanghao Shi; Hao Li; Shamim Yazdani; Ning Zhang; Reza Tourani

arXiv:2603.02436·cs.CR·March 4, 2026

TraceGuard: Process-Guided Firewall against Reasoning Backdoors in Large Language Models

Zhen Guo, Shanghao Shi, Hao Li, Shamim Yazdani, Ning Zhang, Reza Tourani

PDF

Open Access

TL;DR

TraceGuard is a process-guided framework that enhances small models' ability to detect and defend against reasoning backdoors in large language models, improving security in high-stakes applications.

Contribution

It introduces a novel, multi-phase approach combining forensic synthesis, structural fine-tuning, and reinforcement learning to robustly verify reasoning in small models.

Findings

01

Achieves forensic precision comparable to much larger models.

02

Effectively detects latent backdoors and post-hoc rationalizations.

03

Maintains robustness against adaptive adversaries in grey-box settings.

Abstract

The deployment of Large Reasoning Models (LRMs) in high-stakes decision-making pipelines has introduced a novel and opaque attack surface: reasoning backdoors. In these attacks, the model's intermediate Chain-of-Thought (CoT) is manipulated to provide a linguistically plausible but logically fallacious justification for a malicious conclusion. While frontier models exhibit an intrinsic capacity to detect these fractures, compact, deployable models suffer from a fundamental verification gap, relying on fragile lexical heuristics that are easily bypassed by motivated adversaries. To bridge this gap, we propose TraceGuard, a process-guided security framework that transforms small-scale models into robust reasoning firewalls. Our approach treats the reasoning trace as an untrusted payload and establishes a defense-in-depth strategy through three synergistic phases: (1) Automated Forensic…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdversarial Robustness in Machine Learning · Explainable Artificial Intelligence (XAI) · Security and Verification in Computing