TL;DR
VGGT-HPE introduces a relative head pose estimation method that predicts transformations between configurations, trained solely on synthetic data, achieving state-of-the-art results without real-world training.
Contribution
The paper proposes a novel relative head pose estimation approach using synthetic data, outperforming traditional absolute regression methods and validating the advantages of relative prediction.
Findings
Achieves state-of-the-art results on BIWI benchmark.
Relative prediction outperforms absolute regression, especially on difficult poses.
Zero real-world training data suffices for high accuracy.
Abstract
Monocular head pose estimation is traditionally formulated as direct regression from a single image to an absolute pose. This paradigm forces the network to implicitly internalize a dataset-specific canonical reference frame. In this work, we argue that predicting the relative rigid transformation between two observed head configurations is a fundamentally easier and more robust formulation. We introduce VGGT-HPE, a relative head pose estimator built upon a general-purpose geometry foundation model. Finetuned exclusively on synthetic facial renderings, our method sidesteps the need for an implicit anchor by reducing the problem to estimating a geometric displacement from an explicitly provided anchor with a known pose. As a practical benefit, the relative formulation also allows the anchor to be chosen at test time - for instance, a near-neutral frame or a temporally adjacent one - so…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
