PosePilot: Steering Camera Pose for Generative World Models with Self-supervised Depth

Bu Jin; Weize Li; Baihan Yang; Zhenxin Zhu; Junpeng Jiang; Huan-ang Gao; Haiyang Sun; Kun Zhan; Hengtong Hu; Xueyang Zhang; Peng Jia; Hao Zhao

arXiv:2505.01729·cs.CV·July 21, 2025

PosePilot: Steering Camera Pose for Generative World Models with Self-supervised Depth

Bu Jin, Weize Li, Baihan Yang, Zhenxin Zhu, Junpeng Jiang, Huan-ang Gao, Haiyang Sun, Kun Zhan, Hengtong Hu, Xueyang Zhang, Peng Jia, Hao Zhao

PDF

Open Access

TL;DR

PosePilot is a novel framework that improves camera pose control in generative world models for autonomous driving by leveraging self-supervised depth estimation and structure-from-motion principles, resulting in more accurate and consistent viewpoint synthesis.

Contribution

It introduces PosePilot, a lightweight method that enhances camera pose controllability in generative models using self-supervised depth and pose estimation techniques.

Findings

01

Significantly improves structural understanding and motion reasoning.

02

Enhances viewpoint accuracy and geometric consistency.

03

Sets new benchmarks for pose controllability in generative models.

Abstract

Recent advancements in autonomous driving (AD) systems have highlighted the potential of world models in achieving robust and generalizable performance across both ordinary and challenging driving conditions. However, a key challenge remains: precise and flexible camera pose control, which is crucial for accurate viewpoint transformation and realistic simulation of scene dynamics. In this paper, we introduce PosePilot, a lightweight yet powerful framework that significantly enhances camera pose controllability in generative world models. Drawing inspiration from self-supervised depth estimation, PosePilot leverages structure-from-motion principles to establish a tight coupling between camera pose and video generation. Specifically, we incorporate self-supervised depth and pose readouts, allowing the model to infer depth and relative camera motion directly from video sequences. These…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Vision and Imaging · Generative Adversarial Networks and Image Synthesis · Robot Manipulation and Learning