TraceRouter: Robust Safety for Large Foundation Models via Path-Level Intervention

Chuancheng Shi; Shangze Li; Wenjun Lu; Wenhua Wu; Cong Wang; Zifeng Cheng; Fei Shen; Tat-Seng Chua

arXiv:2601.21900·cs.CV·February 3, 2026

TraceRouter: Robust Safety for Large Foundation Models via Path-Level Intervention

Chuancheng Shi, Shangze Li, Wenjun Lu, Wenhua Wu, Cong Wang, Zifeng Cheng, Fei Shen, Tat-Seng Chua

PDF

Open Access

TL;DR

TraceRouter is a path-level intervention framework that enhances the robustness of large foundation models against adversarial attacks by tracing and disconnecting harmful semantic pathways, outperforming existing defenses.

Contribution

It introduces a novel path-level approach to identify and sever causal circuits of malicious semantics, improving robustness without sacrificing utility.

Findings

01

Significantly improves adversarial robustness over baselines

02

Effectively isolates malicious features using sparse autoencoders

03

Maintains high utility while defending against adversarial manipulation

Abstract

Despite their capabilities, large foundation models (LFMs) remain susceptible to adversarial manipulation. Current defenses predominantly rely on the "locality hypothesis", suppressing isolated neurons or features. However, harmful semantics act as distributed, cross-layer circuits, rendering such localized interventions brittle and detrimental to utility. To bridge this gap, we propose \textbf{TraceRouter}, a path-level framework that traces and disconnects the causal propagation circuits of illicit semantics. TraceRouter operates in three stages: (1) it pinpoints a sensitive onset layer by analyzing attention divergence; (2) it leverages sparse autoencoders (SAEs) and differential activation analysis to disentangle and isolate malicious features; and (3) it maps these features to downstream causal pathways via feature influence scores (FIS) derived from zero-out interventions. By…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdversarial Robustness in Machine Learning · Generative Adversarial Networks and Image Synthesis · Explainable Artificial Intelligence (XAI)