TL;DR
DenseMarks introduces a learned 3D canonical embedding for human head images, enabling robust dense correspondence, head tracking, and stereo reconstruction across diverse poses and individuals.
Contribution
It presents a novel Vision Transformer-based approach to learn a canonical 3D embedding for human heads using pairwise point matches and multi-task constraints.
Findings
Achieves state-of-the-art results in geometry-aware point matching.
Demonstrates robust monocular head tracking across pose variations.
Provides a canonical space that is consistent across individuals.
Abstract
We propose DenseMarks - a new learned representation for human heads, enabling high-quality dense correspondences of human head images. For a 2D image of a human head, a Vision Transformer network predicts a 3D embedding for each pixel, which corresponds to a location in a 3D canonical unit cube. In order to train our network, we collect a dataset of pairwise point matches, estimated by a state-of-the-art point tracker over a collection of diverse in-the-wild talking heads videos, and guide the mapping via a contrastive loss, encouraging matched points to have close embeddings. We further employ multi-task learning with face landmarks and segmentation constraints, as well as imposing spatial continuity of embeddings through latent cube features, which results in an interpretable and queryable canonical space. The representation can be used for finding common semantic parts, face/head…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
