LA-Pose: Latent Action Pretraining Meets Pose Estimation
Zhengqing Wang, Saurabh Nair, Prajwal Chidananda, Pujith Kachana, Samuel Li, Matthew Brown, Yasutaka Furukawa

TL;DR
LA-Pose introduces a self-supervised pretraining approach using inverse-dynamics models to improve camera pose estimation, achieving high accuracy with less labeled data.
Contribution
It repurposes latent action features from inverse-dynamics pretraining as inputs for pose estimation, reducing reliance on extensive 3D annotations.
Findings
LA-Pose outperforms recent feed-forward methods by over 10% on Waymo and PandaSet.
Achieves competitive or superior accuracy with significantly less labeled data.
First to demonstrate inverse-dynamics self-supervised learning effectiveness for pose estimation.
Abstract
This paper revisits camera pose estimation through the lens of self-supervised pretraining, focusing on inverse-dynamics pretraining as a scalable alternative to the current trend of fully supervised training with 3D annotations. Concretely, we employ inverse- and forward-dynamics models to learn latent action representations, similar to Genie from large-scale driving videos. Our idea is simple yet effective. Existing methods use latent actions in their original capacity, that is, as action conditioning of world-models or as proxies of robot action parameters in policy networks. Our method, dubbed LA-Pose, repurposes the latent action features as inputs to a camera pose estimator, finetuned on a limited set of high-quality 3D annotations. This formulation enables accurate and generalizable pose prediction while maintaining feed-forward efficiency. Extensive experiments on driving…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
