Safety, Security, and Cognitive Risks in World Models
Manoj Parmar

TL;DR
This paper surveys risks associated with world models in AI, introduces formal risk definitions, presents an attacker taxonomy, and demonstrates adversarial attacks, emphasizing the need for rigorous safety measures.
Contribution
It provides a comprehensive threat model for world models, formal definitions of risks, and empirical validation of adversarial attacks, advancing safety research in AI systems.
Findings
Adversarial attacks can significantly degrade world model performance.
Architecture influences vulnerability to trajectory-persistent adversarial attacks.
Real-world models like DreamerV3 exhibit non-zero action drift under attack.
Abstract
World models - learned internal simulators of environment dynamics - are rapidly becoming foundational to autonomous decision-making in robotics, autonomous vehicles, and agentic AI. By predicting future states in compressed latent spaces, they enable sample-efficient planning and long-horizon imagination without direct environment interaction. Yet this predictive power introduces a distinctive set of safety, security, and cognitive risks. Adversaries can corrupt training data, poison latent representations, and exploit compounding rollout errors to cause significant degradation in safety-critical deployments. At the alignment layer, world model-equipped agents are more capable of goal misgeneralisation, deceptive alignment, and reward hacking. At the human layer, authoritative world model predictions foster automation bias, miscalibrated trust, and planning hallucination. This paper…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
