ViPE: Video Pose Engine for 3D Geometric Perception

Jiahui Huang; Qunjie Zhou; Hesam Rabeti; Aleksandr Korovko; Huan Ling; Xuanchi Ren; Tianchang Shen; Jun Gao; Dmitry Slepichev; Chen-Hsuan Lin; Jiawei Ren; Kevin Xie; Joydeep Biswas; Laura Leal-Taixe; Sanja Fidler

arXiv:2508.10934·cs.CV·August 18, 2025

ViPE: Video Pose Engine for 3D Geometric Perception

Jiahui Huang, Qunjie Zhou, Hesam Rabeti, Aleksandr Korovko, Huan Ling, Xuanchi Ren, Tianchang Shen, Jun Gao, Dmitry Slepichev, Chen-Hsuan Lin, Jiawei Ren, Kevin Xie, Joydeep Biswas, Laura Leal-Taixe, Sanja Fidler

PDF

3 Datasets

TL;DR

ViPE is a versatile video processing engine that accurately estimates camera parameters and dense depth maps from unconstrained videos, enabling large-scale annotation for spatial AI applications.

Contribution

Introduces ViPE, a robust, efficient tool for estimating camera pose and depth from diverse videos, and provides a large annotated dataset to advance spatial AI research.

Findings

01

Outperforms existing pose estimation baselines by 18%/50% on TUM/KITTI.

02

Runs at 3-5 FPS on a single GPU.

03

Annotated approximately 96 million frames across various video types.

Abstract

Accurate 3D geometric perception is an important prerequisite for a wide range of spatial AI systems. While state-of-the-art methods depend on large-scale training data, acquiring consistent and precise 3D annotations from in-the-wild videos remains a key challenge. In this work, we introduce ViPE, a handy and versatile video processing engine designed to bridge this gap. ViPE efficiently estimates camera intrinsics, camera motion, and dense, near-metric depth maps from unconstrained raw videos. It is robust to diverse scenarios, including dynamic selfie videos, cinematic shots, or dashcams, and supports various camera models such as pinhole, wide-angle, and 360{\deg} panoramas. We have benchmarked ViPE on multiple benchmarks. Notably, it outperforms existing uncalibrated pose estimation baselines by 18%/50% on TUM/KITTI sequences, and runs at 3-5FPS on a single GPU for standard input…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.