TraceGuard: Process-Guided Firewall against Reasoning Backdoors in Large Language Models
Zhen Guo, Shanghao Shi, Hao Li, Shamim Yazdani, Ning Zhang, Reza Tourani

TL;DR
TraceGuard is a process-guided framework that enhances small models' ability to detect and defend against reasoning backdoors in large language models, improving security in high-stakes applications.
Contribution
It introduces a novel, multi-phase approach combining forensic synthesis, structural fine-tuning, and reinforcement learning to robustly verify reasoning in small models.
Findings
Achieves forensic precision comparable to much larger models.
Effectively detects latent backdoors and post-hoc rationalizations.
Maintains robustness against adaptive adversaries in grey-box settings.
Abstract
The deployment of Large Reasoning Models (LRMs) in high-stakes decision-making pipelines has introduced a novel and opaque attack surface: reasoning backdoors. In these attacks, the model's intermediate Chain-of-Thought (CoT) is manipulated to provide a linguistically plausible but logically fallacious justification for a malicious conclusion. While frontier models exhibit an intrinsic capacity to detect these fractures, compact, deployable models suffer from a fundamental verification gap, relying on fragile lexical heuristics that are easily bypassed by motivated adversaries. To bridge this gap, we propose TraceGuard, a process-guided security framework that transforms small-scale models into robust reasoning firewalls. Our approach treats the reasoning trace as an untrusted payload and establishes a defense-in-depth strategy through three synergistic phases: (1) Automated Forensic…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdversarial Robustness in Machine Learning · Explainable Artificial Intelligence (XAI) · Security and Verification in Computing
