TrajShield: Trajectory-Level Safety Mediation for Defending Text-to-Video Models Against Jailbreak Attacks

Quanchen Zou; Nizhang Li; Wenxin Zhang; Jiaye Lin; Yangchen Zeng; Xiangzheng Zhang; Zonghao Ying

arXiv:2605.01761·cs.CV·May 5, 2026

TrajShield: Trajectory-Level Safety Mediation for Defending Text-to-Video Models Against Jailbreak Attacks

Quanchen Zou, Nizhang Li, Wenxin Zhang, Jiaye Lin, Yangchen Zeng, Xiangzheng Zhang, Zonghao Ying

PDF

TL;DR

TrajShield is a training-free, inference-time framework that enhances the safety of text-to-video models by detecting and neutralizing unsafe content through causal intervention on temporal trajectories.

Contribution

It introduces a novel causal intervention approach for T2V safety that handles explicit, jailbreak, and emergent risks without retraining models.

Findings

01

Achieves 52.44% average ASR reduction on T2VSafetyBench.

02

Outperforms existing defenses across 14 safety categories.

03

Maintains high semantic fidelity while improving safety.

Abstract

Text-to-Video (T2V) models have demonstrated remarkable capability in generating temporally coherent videos from natural language prompts, yet they also risk producing unsafe content such as violence or explicit material. Existing prompt-level defenses are largely inherited from text-to-image safety and operate on the lexical surface of the input, making them vulnerable to jailbreak attacks that disguise harmful intent through rephrasing or adversarial prompting. Moreover, T2V generation introduces a distinctive challenge overlooked by prior work: temporally emergent risk, where a seemingly benign prompt leads to unsafe content through the generator's temporal extrapolation toward narrative coherence. We propose \method{}, a training-free, inference-time defense framework that reformulates T2V safety as a causal intervention in a temporally structured semantic space. TrajShield handles…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.