Can Generative Video Models Help Pose Estimation?
Ruojin Cai, Jason Y. Zhang, Philipp Henzler, Zhengqi Li, Noah Snavely,, Ricardo Martin-Brualla

TL;DR
This paper introduces InterPose, a novel method that uses pre-trained generative video models to hallucinate intermediate frames, improving pose estimation in challenging scenarios with little or no overlap between images.
Contribution
The paper proposes a new approach leveraging generative video models and a self-consistency score to enhance pose estimation across diverse scenes, outperforming existing methods.
Findings
InterPose improves pose estimation accuracy on multiple datasets.
Using generative video models generalizes well across different scenes.
The self-consistency score effectively filters implausible pose predictions.
Abstract
Pairwise pose estimation from images with little or no overlap is an open challenge in computer vision. Existing methods, even those trained on large-scale datasets, struggle in these scenarios due to the lack of identifiable correspondences or visual overlap. Inspired by the human ability to infer spatial relationships from diverse scenes, we propose a novel approach, InterPose, that leverages the rich priors encoded within pre-trained generative video models. We propose to use a video model to hallucinate intermediate frames between two input images, effectively creating a dense, visual transition, which significantly simplifies the problem of pose estimation. Since current video models can still produce implausible motion or inconsistent geometry, we introduce a self-consistency score that evaluates the consistency of pose predictions from sampled videos. We demonstrate that our…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Vision and Imaging · Human Motion and Animation · Human Pose and Action Recognition
