TRAP: Tail-aware Ranking Attack for World-Model Planning
Siyuan Duan, Ke Zhang, Xizhao Luo

TL;DR
TRAP is a novel backdoor attack targeting the trajectory ranking process in world models, causing planning hijacks and performance drops without disrupting normal behavior on clean inputs.
Contribution
It introduces TRAP, a tail-aware ranking attack framework that exploits the long-tailed trajectory ranking structure in world models for security vulnerabilities.
Findings
TRAP effectively hijacks planning by altering trajectory rankings.
Experiments show TRAP causes significant performance degradation.
TRAP maintains normal ranking on clean inputs, ensuring stealth.
Abstract
World models enable long-horizon planning by internally generating and evaluating imagined trajectories, making them a promising foundation for generalist agents. However, this imagination-driven decision process also introduces new security risks. Existing backdoor attacks typically aim to manipulate local features, one-step predictions, or instantaneous policy outputs. While such objectives may suffice for weaker reactive models, they are often ineffective against world models, where the learned dynamics prior and planning process can absorb or wash out the effects of shallow perturbations. More importantly, we find that world models exhibit a distinct backdoor vulnerability rooted in the long-tailed ranking structure of imagined trajectories, where disrupting the ordering of a few decision-critical trajectories can systematically hijack planning. To exploit this vulnerability, we…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
