Self-supervised Learning of Interpretable Keypoints from Unlabelled   Videos

Tomas Jakab; Ankush Gupta; Hakan Bilen; Andrea Vedaldi

arXiv:1907.02055·cs.CV·December 24, 2020

Self-supervised Learning of Interpretable Keypoints from Unlabelled Videos

Tomas Jakab, Ankush Gupta, Hakan Bilen, Andrea Vedaldi

PDF

1 Video

TL;DR

This paper introduces KeypointGAN, a self-supervised approach that learns interpretable object keypoints from unlabelled videos by analyzing frame differences and leveraging a dual geometric representation, achieving state-of-the-art results without labeled data.

Contribution

The method uniquely combines a dual geometric representation and empirical pose priors from unpaired data to learn pose recognition without annotated images.

Findings

01

Achieves state-of-the-art performance on pose recognition benchmarks.

02

Learns interpretable keypoints using only unlabelled videos.

03

Effectively disentangles pose from appearance.

Abstract

We propose KeypointGAN, a new method for recognizing the pose of objects from a single image that for learning uses only unlabelled videos and a weak empirical prior on the object poses. Video frames differ primarily in the pose of the objects they contain, so our method distils the pose information by analyzing the differences between frames. The distillation uses a new dual representation of the geometry of objects as a set of 2D keypoints, and as a pictorial representation, i.e. a skeleton image. This has three benefits: (1) it provides a tight `geometric bottleneck' which disentangles pose from appearance, (2) it can leverage powerful image-to-image translation networks to map between photometry and geometry, and (3) it allows to incorporate empirical pose priors in the learning process. The pose priors are obtained from unpaired data, such as from a different dataset or modality…

Tables5

Table 1. Table 1: Human landmark detection (Simplified H3.6M). Comparison with state-of-the-art methods for human landmark detection on the Simplified Human3.6M dataset [ 78 ] . We report % percent \% -MSE normalised by image size for each activity.

fully supervised
Method	all	wait	pose	greet	direct	discuss	walk
Newell et al. [39]	2.16	1.88	1.92	2.15	1.62	1.88	2.21
self-supervised + supervised post-processing
Thewlis et al. [57]	7.51	7.54	8.56	7.26	6.47	7.93	5.40
Zhang et al. [78]	4.14	5.01	4.61	4.76	4.45	4.91	4.61
Lorenz et al. [35]	2.79	—	—	—	—	—	—
self-supervised (no post-processing)
KeypointGAN (ours)	2.73	2.66	2.27	2.73	2.35	2.35	4.00

Table 2. Table 2: Human landmark detection (full H3.6M). Comparison on Human3.6M test set with a supervised baseline Newell et al . [ 39 ] , and a self-supervised method [ 25 ] . We report the MSE in pixels [ 23 ] . Results for each activity are in the supplementary.

fully supervised
Method	Human3.6M
Newell et al. [39]	19.52
self-supervised + supervised post-processing
Jakab & Gupta et al. [25]	19.12
self-supervised (no post-processing)
KeypointGAN with 3DHP prior (ours)	18.94
KeypointGAN with H3.6M prior (ours)	14.46

Table 3. Table 3: Facial landmark detection. Comparison with state-of-the-art methods on 2D facial landmark detection. We report the inter-ocular distance normalised keypoint localisation error [ 79 ] (in % percent \% ; ↓ ↓ \downarrow is better) on the 300-W test set. † † \dagger : [ 59 ] evaluate using two different networks: (1) SmallNet which we outperform, (2) HourGlass is not directly comparable due to much larger capacity (4M vs 12M parameters).

fully supervised
Method	300-W
LBF [46]	6.32
CFSS [82]	5.76
cGPRT [32]	5.71
DDN [75]	5.65
TCDCN [79]	5.54
RAR [72]	4.94
Wing Loss [13]	4.04
self-supervised + supervised post-processing
Thewlis et al. [58]	9.30
Thewlis et al. [57]	7.97
Thewlis et al. [59] SmallNet ^†	5.75
Wiles et al. [71]	5.71
Jakab & Gupta et al. [25]	5.39
Thewlis et al. [59] HourGlass ^†	4.65
self-supervised
KeypointGAN (ours w/o post-processing)	8.67
+ supervised post-processing	5.12

Table 4. Table 4: Ablation study. We start with the CycleGAN [ 81 ] model and sequentially augment it with — (1) conditional image generator ( Ψ Ψ \Psi ), (2) skeleton bottleneck ( β ∘ η 𝛽 𝜂 \beta\circ\eta ), and (3) remove the second cycle-constraint resulting in our proposed KeypointGAN model. An auto-encoding model with a skeleton image as the intermediate representation ( i.e . no keypoint bottleneck) and an adversarial loss is also reported (last row). We report 2D landmark detection error ( ↓ ↓ \downarrow is better) on the Simplified Human3.6M ( section 5.1 ) for human pose, on the 300-W ( section 5.2 ) for faces.

Method	humans	faces
CycleGAN	3.54	11.89
+ conditional generator (1)	3.60	–
+ skeleton-bottleneck (2)	2.86	9.64
$-$ $2^{nd}$ cycle = KeypointGAN (ours) (3)	2.73	8.67
CycleGAN $-$ $2^{nd}$ cycle	3.39	11.36

Table 5. Table 5: Varying # of unpaired landmark samples. We train KeypointGAN using varying numbers of samples for landmark prior. For faces, we sample the prior from MultiPIE dataset and evaluate on 300-W ( section 5.2 ). For human pose, we sample the prior from the disjoint part of the Simplified Human3.6M training set and evaluate on the test set ( section 5.1 ). We report the keypoint localisation error ( ± σ plus-or-minus 𝜎 \pm\sigma ) (in % percent \% ; ↓ ↓ \downarrow is better). Full dataset has 6k unpaired samples for faces, and 400k for humans. Decreasing the number of unpaired landmark samples retains most of the performance.

# unpaired	humans	faces
samples	no post-proc.	no post-proc.	+ sup. post-proc.
full dataset	$2.73$	$8.67$	$5.12$
5000	$2.92 \pm 0.05$	–	–
500	$3.30 \pm 0.06$	$8.91 \pm 0.15$	$5.22 \pm 0.04$
50	$4.05 \pm 0.02$	$8.92 \pm 0.20$	$5.19 \pm 0.06$

Equations12

x = Ψ (Φ (x), x^{'}) .

x = Ψ (Φ (x), x^{'}) .

β (p)_{u} = exp (- γ (i, j) \in E, r \in [0, 1] min ∥ u - r p_{i} - (1 - r) p_{j} ∥^{2})

β (p)_{u} = exp (- γ (i, j) \in E, r \in [0, 1] min ∥ u - r p_{i} - (1 - r) p_{j} ∥^{2})

x = Ψ (β \circ η \circ Φ (x), x^{'}) .

x = Ψ (β \circ η \circ Φ (x), x^{'}) .

L_{perc} = \frac{1}{N} i = 1 \sum N ∥Γ (\hat{x}_{i}) - Γ (x_{i}) ∥_{2}^{2},

L_{perc} = \frac{1}{N} i = 1 \sum N ∥Γ (\hat{x}_{i}) - Γ (x_{i}) ∥_{2}^{2},

L_{disc} (D) = \frac{1}{M} j = 1 \sum M D (\overset{ˉ}{y}_{j})^{2} + \frac{1}{N} i = 1 \sum N (1 - D (y_{i}))^{2} .

L_{disc} (D) = \frac{1}{M} j = 1 \sum M D (\overset{ˉ}{y}_{j})^{2} + \frac{1}{N} i = 1 \sum N (1 - D (y_{i}))^{2} .

L (Φ, Ψ, D) = λ L_{disc} (D, Φ) + L_{perc} (Ψ, Φ),

L (Φ, Ψ, D) = λ L_{disc} (D, Φ) + L_{perc} (Ψ, Φ),

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Self-Supervised Learning of Interpretable Keypoints From Unlabelled Videos· youtube

Taxonomy

MethodsBatch Normalization · Residual Connection · PatchGAN · *Communicated@Fast*How Do I Communicate to Expedia? · Tanh Activation · Residual Block · Instance Normalization · Convolution · HuMan(Expedia)||How do I get a human at Expedia? · Sigmoid Activation

Full text

Self-supervised Learning of Interpretable Keypoints from Unlabelled Videos

Tomas Jakab

Visual Geometry Group

University of Oxford

[email protected]

Ankush Gupta

DeepMind, London

[email protected]

Hakan Bilen

School of Informatics

University of Edinburgh

[email protected]

Andrea Vedaldi

Visual Geometry Group

University of Oxford

[email protected]

Abstract

We propose KeypointGAN, a new method for recognizing the pose of objects from a single image that for learning uses only unlabelled videos and a weak empirical prior on the object poses. Video frames differ primarily in the pose of the objects they contain, so our method distils the pose information by analyzing the differences between frames. The distillation uses a new dual representation of the geometry of objects as a set of 2D keypoints, and as a pictorial representation, i.e. a skeleton image. This has three benefits: (1) it provides a tight ‘geometric bottleneck’ which disentangles pose from appearance, (2) it can leverage powerful image-to-image translation networks to map between photometry and geometry, and (3) it allows to incorporate empirical pose priors in the learning process. The pose priors are obtained from unpaired data, such as from a different dataset or modality such as mocap, such that no annotated image is ever used in learning the pose recognition network. In standard benchmarks for pose recognition for humans and faces, our method achieves state-of-the-art performance among methods that do not require any labelled images for training. Project page: http://www.robots.ox.ac.uk/~vgg/research/unsupervised_pose/

1 Introduction

Learning with limited or no external supervision is one of the most significant open challenges in machine learning. In this paper, we consider the problem of learning the 2D geometry of object categories such as humans and faces using raw videos and as little additional supervision as possible. In particular, given as input a number of videos centred on the object, the goal is to learn automatically a neural network that can predict the pose of the object from a single image.

Learning from unlabelled images requires a suitable supervisory signal. Recently, [25] noted that during a video an object usually maintains its intrinsic appearance but changes its pose. Hence, the concept of pose can be learned by modelling the differences between video frames. They formulate this as conditional image generation. They extract a small amount of information from a given target video frame via a tight bottleneck which retains pose information while discarding appearance. For supervision, they reconstruct the target frame from the extracted pose, similar to an auto-encoder. However, since pose alone does not contain sufficient information to reconstruct the appearance of the object, they also pass to the generator a second video frame from which the appearance can be observed.

In this paper, we also consider a conditional image generation approach, but we introduce a whole new design for the model and for the ‘pose bottleneck’. In particular, we adopt a dual representation of pose as a set of 2D object coordinates, and as a pictorial representation of the 2D coordinates in the form of a skeleton image. We also define a differentiable skeleton generator to map between the two representations.

This design is motivated by the fact that, by encoding pose labels as images we can leverage powerful image-to-image translation networks [81] to map between photometry and geometry. In fact, the two sides of the translation process, namely the input image and its skeleton, are spatially aligned, which is well known to simplify learning by a Convolutional Neural Network (CNN) [81]. At the same time, using 2D coordinates provides a very tight bottleneck that allows the model to efficiently separate pose from appearance.

The pose bottleneck is further controlled via a discriminator, learned adversarially. This has the advantage of injecting prior information about the possible object poses in the learning process. While acquiring this prior may require some supervision, this is separate from the unlabelled videos used to learn the pose recognizer — that is, our method is able to leverage unpaired supervision. In this way, our method outputs poses that are directly interpretable. We refer to our proposed method as KeypointGAN. By contrast, state-of-the-art self-supervised keypoint detectors [57, 25, 78, 71, 53] do not learn “semantic” keypoints and, in post-processing, they need at least some paired supervision to output human-interpretable keypoints. We highlight this difference in fig. 1.

Overall, we make three significant contributions:

We introduce a new conditional generator design combining image translation, a new bottleneck using a dual representation of pose, and an adversarial loss which significantly improve recognition performance. 2. 2.

We learn, for the first time, to directly predict human-interpretable landmarks without requiring any labelled images. 3. 3.

We obtain state-of-the-art unsupervised landmark detection performance even when compared against methods that use paired supervision in post-processing.

We test our approach using videos of people, faces, and cat images. On standard benchmarks such as Human3.6M [23] and 300-W [50], we achieve state-of-the-art pose recognition performance for methods that learn only from unlabelled images. We also probe generalization by testing whether the empirical pose prior can be extracted independently from the videos used to train the pose recognizer. We demonstrate this in two challenging scenarios. First, we use the mocap data from MPI-INF-3DHP [38] as prior and we learn a human pose recognizer on videos from Human3.6M. Second, we use the MultiPIE [54] dataset as prior to learn a face pose recognizer on VoxCeleb2 [10] videos, and achieve state-of-the-art facial keypoint detection performance on 300-W.

2 Related work

We consider pose recognition, intended as the problem of predicting the 2D pose of an object from a single image. Approaches to this problem must be compared in relation to (1) the type of supervision, and (2) which priors they use. There are three broad categories for supervision: full supervision when the training images are annotated with the same labels that one wishes to predict; weak supervision when the predicted labels are richer than the image annotations; and no supervision when there are no image annotations. For the prior, methods can use a prior model learned from any kind of data or supervision, an empirical prior, or no prior at all.

Based on this definition, our method is unsupervised and uses an empirical prior. Next, we relate our work to others, dividing them by the type of supervision used (our method falls in the last category).

Full supervision.

Several fully-supervised methods leverage large annotated datasets such as MS COCO Keypoints [33], Human3.6M [23], MPII [2] and LSP [27]. They generally do not use a separate prior as the annotations themselves capture one empirically. Some methods use pictorial structures [12] to model the object poses [1, 51, 74, 43, 40, 44]. Others use a CNN to directly regress keypoint coordinates [62], keypoint confidence maps [61], or other relations between keypoints [9]. Others again apply networks iteratively to refine heatmaps for single [70, 39, 3, 42, 8, 6, 60] and multi-person settings [22, 7]. Our method does not use any annotated image to learn the pose recognizer.

Weak supervision.

A typical weakly-supervised method is the one of Kanazawa et al. [29]: they learn to predict dense 3D human meshes from sparse 2D keypoint annotations. They use two priors: SMPL [34] parametric human mesh model, and a prior on 3D poses acquired via adversarial learning from mocap data. Analogous works include [64, 73, 49, 17, 16, 18, 52, 69].

All such methods use a prior trained on unpaired data, as we do. However, they also use additional paired annotations such as 2D keypoints or relative depth relations [49]. Furthermore, in most cases they use a fully-fledged 3D prior such as SMPL human [34] or Basel face [41] models, while we only use an empirical prior in the form of example 2D keypoints configurations.

No supervision.

Other methods use no supervision, and some no data-driven prior either. The works of [28, 48, 71, 53] learn to match pairs of images of an object, but they do not learn geometric invariants such as keypoints. [57, 58, 59] do learn sparse and dense landmarks, also without any annotation. The method of [56] does not use image annotations, but uses instead synthetic views of 3D models as prior, which we do not require.

Some of these methods use conditional image generation as we do. Jakab & Gupta et al.[25], the most related, is described in the introduction. Zhang et al. [78], Lorenz et al. [35] develop an auto-encoding formulation to discover landmarks as explicit structural representations for a given image and use them to reconstruct the original image. Wiles et al. [71], Shu et al. [53] learn a dense deformation field for faces. Our method differs from those in the particular nature of the model and geometric bottleneck; furthermore, due to our use of a prior, we are able to learn out-of-the-box landmarks that are ‘semantically meaningful’; on the contrary, these approaches must rely on at least some paired supervision to translate between the unsupervised and ‘semantic’ landmarks. We also outperform these approaches in landmark detection quality.

Adversarial learning.

Our method is also related to adversarial learning, which has proven to be useful in image labelling [14, 21, 65, 66, 20] and generation [19, 81], including bridging the domain shift between real and generated images. Most relevant to our work, Isola et al. [24] propose an image-to-image translation framework using paired data, while CycleGAN [81] can do so with unpaired data. Our method also uses a image-to-image translation networks, but compared to CycleGAN our use of conditional image generation addresses the logical fallacy that an image-like label (a skeleton) does not contain sufficient information to generate a full image — this issue is discussed in depth in section 4.

Appearance and geometry factorization.

Recent methods for image generation conditioned on object attributes, like viewpoint [47], pose [63], and hierarchical latents [55] have been proposed. Our method allows for similar but more fine-grained conditional image generation, conditioned on an appearance image or object landmarks. Many unsupervised methods for pose estimation [25, 78, 35, 71, 53] share similar ability. However, we can achieve more accurate and predictable image editing by manipulating semantic parts in the image through their corresponding landmarks.

3 Method

Our goal is to learn a network $\Phi:\bm{x}\mapsto\bm{y}$ that maps an image $\bm{x}$ containing an object to its pose $\bm{y}$ . To avoid having to use image annotations, the network is trained using an auto-encoder formulation. Namely, given the pose $\bm{y}=\Phi(\bm{x})$ extracted from the image, we train a decoder network $\Psi$ that reconstructs the image from the pose. However, since pose lacks appearance information, this reconstruction task is ill posed. Hence, we also provide the decoder with a different image $\bm{x}^{\prime}$ of the same object to convey its appearance. Formally, the image $\bm{x}$ is reconstructed from the pose $\bm{y}$ and the auxiliary image $\bm{x}^{\prime}$ via a conditional decoder network

[TABLE]

Unfortunately, without additional constraints, this formulation fails to learn pose properly [25]. The reason is that, given enough freedom, the encoder $\Phi(\bm{x})$ may simply decide to output a copy of the input image $\bm{x}$ , which allows it to trivially satisfy constraint (1) without learning anything useful (this issue is visualized in sections 4 and 4). The formulation needs a mechanism to force the encoder $\Phi$ to ‘distil’ only pose information and discard appearance.

We make two key contributions to address these issues. First, we introduce a dual representation of pose as a vector of 2D keypoint coordinates and as a pictorial representation in the form of ‘skeleton’ image (section 3.1). We show that this dual representation provides a tight bottleneck that distils pose information effectively while making it possible to implement the auto-encoder (1) using powerful image-to-image translation networks.

Our second contribution is to introduce an empirical prior on the possible object poses (section 3.2). In this manner, we can constrain not just the individual pose samples $\bm{y}$ , but their distribution $p(\bm{y})$ as well. In practice, the prior allows to use unpaired pose samples to improve accuracy and to learn an human-interpretable notion of pose that does not necessitate further learning to be used in applications.

3.1 Dual representation of pose & bottleneck

We consider a dual representation of the pose of an object as a vector of $K$ 2D keypoint coordinates $\bm{p}=(p_{1},\dots,p_{K})\in\Omega^{K}$ and as an image $\bm{y}\in\mathbb{R}^{\Omega}$ containing a pictorial rendition of the pose as a skeleton (see fig. 2 for an illustration). Here the symbol $\Omega=\{1,\dots,H\}\times\{1,\dots,W\}$ denotes a grid of pixel coordinates.

Representing pose as a set of 2D keypoints provides a tight bottleneck that preserves geometry but discards appearance information. Representing pose as a skeleton image allows to implement the encoder and decoder networks as image translation networks. In particular, the image of the object $\bm{x}$ and of its skeleton $\bm{y}$ are spatially aligned, which makes it easier for a CNN to map between them.

Next, we show how to switch between the two representations of pose. We define the mapping $\bm{y}=\beta(\bm{p})$ from the coordinates $\bm{p}$ to the skeleton image $\bm{y}$ analytically. Let $E$ be the set of keypoint pairs $(i,j)$ connected by a skeleton edge and let $u\in\Omega$ be an image pixel. Then the skeleton image is given by:

[TABLE]

The differentiable function $\bm{y}=\beta(\bm{p})$ defines a distance field from line segments that form the skeleton and applies an exponential fall off to generate an image. The visual effect is to produce a smooth line drawing of the skeleton. We also train an inverse function $\bm{p}=\eta(\bm{y})$ , implementing it as a neural network regressor (see supplementary for details).

Given the two maps $(\eta,\beta)$ , we can use either representation of pose, as needed. In particular, by using the pictorial representation $\bm{y}$ , the encoder/pose recogniser can be written as an image-to-image translation network $\Phi:\bm{x}\mapsto\bm{y}$ whose input $\bm{x}\in\mathbb{R}^{3\times H\times W}$ and output $\bm{y}$ are both images. The same is true for the conditional decoder $\Psi:(\bm{y},\bm{x}^{\prime})\mapsto\bm{x}$ of eq. 1.

While image-to-image translation is desirable architecturally, the downside of encoding pose as an image $\bm{y}$ is that it gives the encoder $\Phi$ an opportunity to ‘cheat’ and inject appearance information in the pose representation $\bm{y}$ . We can prevent cheating by exploiting the coordinate representation of pose to filter out any hidden appearance information form $\bm{y}$ . We do so by converting the pose image into keypoints and then back. This amounts to substituting $\bm{y}=\beta\circ\eta(\bm{y})$ in eq. 1, which yields the modified auto-encoding constraint:

[TABLE]

3.2 Learning formulation & pose prior

Auto-encoding loss.

In order to learn the auto-encoder (3), we use a dataset of $N$ example pairs of video frames $\{(\bm{x}_{i},\bm{x}_{i}^{\prime})\}_{i=1}^{N}$ . Then the auto-encoding constraint (3) is enforced by optimizing a reconstruction loss. Here we use a perceptual loss:

[TABLE]

where $\hat{\bm{x}}_{i}=\Psi(\beta\circ\eta\circ\Phi(\bm{x}_{i}),\bm{x}_{i}^{\prime})$ is the reconstructed image, $\Gamma$ is a feature extractor. Instead of comparing pixels directly, the perceptual loss compares features extracted from a standard network such as VGG [26, 11, 15, 5], and leads to more robust training.

Pose prior.

In addition to the $N$ training image pairs $\{(\bm{x}_{i},\bm{x}_{i}^{\prime})\}_{i=1}^{N}$ , we also assume to have access to $M$ sample poses $\{\bar{\bm{p}}_{j}\}_{j=1}^{M}$ . Importantly, these sample poses are unpaired, in the sense that they are not annotations of the training images.

We use the unpaired pose samples to encourage the predicted poses $\bm{y}$ to be plausible. This is obtained by matching two distributions. The reference distribution $q(\bm{y})$ is given by the unpaired pose samples $\{\bar{\bm{y}}_{j}=\beta(\bar{\bm{p}}_{j})\}_{j=1}^{M}$ . The other distribution $p(\bm{y})$ is given by the pose samples $\{\bm{y}_{i}=\Phi(\bm{x}_{i})\}_{i=1}^{N}$ predicted by the learned encoder network from the example video frames $\bm{x}_{i}$ .

The goal is to match $p(\bm{y})\approx q(\bm{y})$ in a distributional sense. This can be done by learning a discriminator network $D(\bm{y})\in[0,1]$ whose purpose is to discriminate between the unpaired samples $\bar{\bm{y}}_{j}=\beta(\bar{\bm{p}}_{j})$ and the predicted samples $\bm{y}_{i}=\Phi(\bm{x}_{i})$ . Samples are compared by means of the difference adversarial loss of [37]:

[TABLE]

In addition to capturing plausible poses, the pose discriminator $D(\bm{y})$ also encourages the images $\bm{y}$ to be ‘skeleton-like’. The effect is thus similar to the bottleneck introduced in section 3.1 and one may wonder if the discriminator makes the bottleneck redundant. The answer, as shown in sections 4 and 5, is negative: both are needed.

Overall learning formulation.

Combining losses (4) and (5) yields the overall objective:

[TABLE]

where $\lambda$ is a loss-balancing factor. The components of this model, KeypointGAN, and their relations are illustrated in fig. 2. Similar to any adversarial formulation, eq. 6 is minimized w.r.t. $\Phi,\Psi$ and maximised w.r.t. $D$ .

Details.

The functions $\Phi$ , $\Psi$ , $\eta$ and $D$ are implemented as convolutional neural networks. The auto-encoder functions $\Phi$ and $\Psi$ and the discriminator $D$ are trained by optimizing the objective in eq. 6 ( $\eta$ is pre-trained using unpaired landmarks, for details see supplementary). Batches are formed by sampling random pairs of video frames $(\bm{x}_{i},\bm{x}_{i}^{\prime})$ and unpaired pose $\bar{\bm{y}}_{j}$ samples. When sampling from image datasets (instead of videos), we generate image pairs as $(g_{1}(\bm{x}_{i}),g_{2}(\bm{x}_{i}))$ by applying random thin-plate-splines $g_{1},g_{2}$ to training samples $\bm{x}_{i}$ . All the networks are trained from scratch. Architectures and training details are in the supplementary.

4 Relation to image-to-image translation

Our method is related to unpaired image-to-image translation, of which CycleGAN [81] is perhaps the best example, but with two key differences: (a) it has a bottleneck (section 3.1) that prevents leaking appearance information into the pose representation $\bm{y}$ , and (b) it reconstructs the image $\bm{x}$ conditioned on a second image $\bm{x}^{\prime}$ . We show in the experiments that these changes are critical for pose recognition performance, and conduct a further analysis here.

First, consider what happens if we drop both changes (a) and (b), thus making our formulation more similar to CycleGAN. In this case, eq. 1 reduces to $\bm{x}=\Psi(\Phi(\bm{x}))$ . The trivial solution of setting both $\Phi$ and $\Psi$ to the identity functions is only avoided due to the discriminator loss (5), which encourages $\bm{y}=\Phi(\bm{x})$ to look like a skeleton (rather than a copy of $\bm{x}$ ). In theory, then, this problem should be ill-posed as the pose $\bm{y}$ should not have sufficient information to recover the input image $\bm{x}$ . However, the reconstructions from such a network still look reasonably good (see fig. 4). A closer look at logarithm of the generated skeleton $\bm{y}$ , reveals that CycleGAN ‘cheats’ by leaking appearance information via subtle patterns in $\bm{y}$ . By contrast, our bottleneck significantly limits leaking appearance in the pose image and thus its ability to reconstruct $\bm{x}=\Psi(\beta\circ\eta\circ\Phi(\bm{x}))$ from a single image; instead, reconstruction is achieved by injecting the missing appearance information via the auxiliary image $\bm{x}^{\prime}$ using a conditional image decoder (eq. 3).

5 Experiments

We evaluate our method, KeypointGAN, on the task of 2D landmark detection for human pose (section 5.1), faces (section 5.2), and cat heads (section 5.3) and outperform state-of-the-art methods (tables 3, 1 and 2) on these tasks. We examine the relative contributions of components of our model in an ablation study (section 5.4). We study the effect of reducing the number of pose samples used in the empirical prior (section 5.5). Finally, we demonstrate image generation and manipulation conditioned on appearance and pose (section 5.6).

Evaluation.

KeypointGAN directly outputs predictions for keypoints that are human-interpretable. In contrast, self-supervised methods [58, 57, 71, 59, 78, 35, 25] predict only machine-interpretable keypoints, as illustrated in fig. 1, and require at least some example images with paired keypoint annotations in order to learn to convert these landmarks to human-interpretable ones for benchmarking or for applications. We call this step supervised post-processing. KeypointGAN does not require this step, but we also include this result for a direct comparison with previous methods.

5.1 Human pose

Datasets.

Simplified Human3.6M introduced by Zhang et al. [78] for evaluating unsupervised pose recognition, contains 6 activities in which human bodies are mostly upright; it comprises 800k training and 90k testing images. Human3.6M [23] is a large-scale dataset that contains 3.6M accurate 2D and 3D human pose annotations for 17 different activities, imaged under 4 viewpoints and a static background. For training, we use subjects 1, 5, 6, 7, and 8, and subjects 9 and 11 for evaluation, as in [68]. PennAction [77] contains 2k challenging consumer videos of 15 sports categories. MPI-INF-3DHP [38] is a mocap dataset containing 8 subjects performing 8 activities in complex exercise poses. There are 28 joints annotated.

We split datasets into two disjoint parts for sampling image pairs $(\bm{x},\bm{x}^{\prime})$ (cropped to the provided bounding boxes), and skeleton prior respectively to ensure that the pose data does not contain labels corresponding to the training images. For the Human3.6M datasets we split the videos in half, while for PennAction we split in half the set of videos from each action category. We also evaluate the case when images and skeletons are sampled from different datasets and for this purpose we use the MPI-INF-3DHP mocap data.

Evaluation.

We report 2D landmark detection performance on the simplified and original Human3.6M datasets. For Simplified Human3.6M, we follow the standard protocol of [78] and report the error for all 32 joints normalized by the image size. For Human3.6M, we instead report the mean error in pixels over 17 of the 32 joints [23]. To demonstrate learning from unpaired prior, we consider two settings for sourcing the images and the prior. In the first setting, we use different datasets for the two, and sample images from Human3.6M and poses from MPI-INF-3DHP. In the second setting, we use instead two disjoint parts of the same dataset Human3.6M for both images and poses. When using MPI-INF-3DHP dataset as the prior, we predict 28 joints, but use 17 joints that are common with Human3.6M for evaluation. We train KeypointGAN from scratch and compare its performance with both supervised and unsupervised methods.

Results.

Table 1 reports the results on Simplified Human3.6M. As in previous self-supervised works [78, 57], we compare against the supervised baseline by Newell et al. [39]. Our model outperforms all the baselines [35, 78, 57] without the supervised post-processing used by the others.

Table 2 summarises our results on the original Human3.6M test set. Here we also compare against the supervised baseline [39] and the self-supervised method of [25]. Our model outperforms the baselines in this test too.

It may be surprising that KeypointGAN outperforms the supervised baseline. A possible reason is the limited number of supervised examples, which causes the supervised baseline to overfit. This can be noted by comparing the training / test errors: 14.61 / 19.52 for supervised hourglass and 13.79 / 14.46 for our method.

When poses are sampled from a different dataset (MPI-INF-3DHP) than the images (Human3.6M), the error is higher at 18.94 (but still better than the supervised alternative). This increase is due to the domain gap between the two datasets. Figure 5 shows some qualitative examples. Limitations of KeypointGAN are highlighted in fig. 6.

5.2 Human faces

Datasets.

VoxCeleb2 [10] is a large-scale dataset consisting of 1M short clips of talking-head videos extracted from YouTube. MultiPIE [54] contains 68 labelled facial landmarks and 6k samples. We use this dataset as the only source for the prior. 300-W [50] is a challenging dataset of facial images obtained by combining multiple datasets [4, 80, 45] as described in [57, 46]. As in MultiPIE, 300-W contains 68 annotated facial landmarks. We use 300-W as our test dataset and follow the evaluation protocol in [46].

Results.

As for human pose, we study a scenario where images and poses are sourced from a different datasets, using VoxCeleb2 and 300-W for the images, and MultiPIE (6k samples) for the poses (fig. 7). We train KeypointGAN from scratch using video frames from VoxCeleb2; then we fine-tune the model using our unsupervised method on the 300-W training images. We report performance on 300-W test set in table 3. KeypointGAN performs well even without any supervised fine-tuning on the target 300-W, and it already outperforms the unsupervised method of [58]. Adding supervised post-processing (on 300-W training set) as done in all self-supervised learning methods [58, 57, 71, 59], we outperform all except for [59] when they use their HG network that has 3 times more learnable parameters (4M vs 12M parameters). Interestingly we also outperform all supervised methods except [72, 13].

5.3 Cat heads

Cat Head [76] dataset contains 9k images of cat heads each annotated with 7 landmarks. We use the same train and test split as [78]. We split the training set into two equally sized parts with no overlap. The first one is used to sample training images and the second one for the landmark prior. Our predictions are visualized in fig. 8.

5.4 Ablation study

As noted above, we can obtain our method by making the following changes to CycleGAN: (1) switching to a conditional image generator $\Psi$ , (2) introducing the skeleton bottleneck $\beta\circ\eta$ , and (3) removing the “second auto-encoder cycle” for the other domain (in our case the skeleton images). table 4 shows the effect of modifying CycleGAN in this manner on Simplified Human3.6M [78] for humans and on 300-W [50] for faces.

The baseline CycleGAN can be thought of as learning a mapping between images and skeletons via off-the-shelf image translation. Switching to a conditional image generator (1) does not improve the results because the model can still leak appearance information’s pose. However, introducing the bottleneck (2) improves performance significantly for both humans ( $2.86\%$ vs. $3.54\%$ CycleGAN, a 20% error reduction) and faces ( $11.89\%$ vs. $9.64\%$ CycleGAN, a 19% error reduction). This also justifies the use of a conditional generator as the model fails to converge if the bottleneck is used without it. Removing the second cycle (3) leads to further improvements, showing that this part is detrimental for our task.

5.5 Unpaired sample efficiency

Table 5 demonstrates that KeypointGAN retains state-of-the-art performance even when we use only 50 unpaired landmark samples for the empirical prior. The experiment was done following the same protocols for training on face and human datasets as described previously.

5.6 Appearance and geometry factorization

The conditional image generator $\Psi:(\bm{y}^{*},\bm{x}^{\prime})\mapsto\hat{\bm{x}}$ of eq. 1 can also be used to produce novel images by combining pose and appearance from different images. Figure 9 shows that the model can be used to transfer the appearance of a human face identity on top of the pose of another. Though generating high quality images is not our primary goal, the ability to transfer appearance shows that KeypointGAN properly factorizes the latter from pose.

This also demonstrates significant generalization over the training setting, as the system only learns from pairs of frames sampled from the same video and thus with same identity, but it can swap different identities. In fig. 10, we further leverage the disentanglement of geometry and appearance to manipulate a face by editing its keypoints.

6 Conclusion

We have shown that combining conditional image generation with a dual representation of pose with a tight geometric bottleneck can be used to learn to recognize the pose of complex objects such as humans without providing any labelled image to the system. In order to do so, KeypointGAN makes use of an unpaired pose prior, which also allows it to output human-interpretable pose parameters. With this, we have achieved optimal landmark detection accuracy for methods that do not use labelled images for training.

Acknowledgements.

We are grateful for the support of ERC 638009-IDIU, and the Clarendon Fund Scholarship. We would like to thank Triantafyllos Afouras, Relja Arandjelović, and Chuhan Zhang for helpful advice.

Bibliography82

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1Andriluka et al. [2009] Mykhaylo Andriluka, Stefan Roth, and Bernt Schiele. Pictorial structures revisited: People detection and articulated pose estimation. In Proc. CVPR , pages 1014–1021. IEEE, 2009.
2Andriluka et al. [2014] Mykhaylo Andriluka, Leonid Pishchulin, Peter Gehler, and Bernt Schiele. 2d human pose estimation: New benchmark and state of the art analysis. In Proc. CVPR , pages 3686–3693, 2014.
3Belagiannis and Zisserman [2017] Vasileios Belagiannis and Andrew Zisserman. Recurrent human pose estimation. In 2017 12th IEEE International Conference on Automatic Face & Gesture Recognition (FG 2017) , pages 468–475. IEEE, 2017.
4Belhumeur et al. [2013] Peter N Belhumeur, David W Jacobs, David J Kriegman, and Neeraj Kumar. Localizing parts of faces using a consensus of exemplars. TPAMI , 35(12):2930–2940, 2013.
5Bruna et al. [2016] Joan Bruna, Pablo Sprechmann, and Yann Le Cun. Super-resolution with deep convolutional sufficient statistics. In Proc. ICLR , 2016.
6Bulat and Tzimiropoulos [2016] Adrian Bulat and Georgios Tzimiropoulos. Human pose estimation via convolutional part heatmap regression. In Proc. ECCV , pages 717–732. Springer, 2016.
7Cao et al. [2017] Zhe Cao, Tomas Simon, Shih-En Wei, and Yaser Sheikh. Realtime multi-person 2d pose estimation using part affinity fields. In Proc. CVPR , pages 7291–7299, 2017.
8Carreira et al. [2016] Joao Carreira, Pulkit Agrawal, Katerina Fragkiadaki, and Jitendra Malik. Human pose estimation with iterative error feedback. In Proc. CVPR , pages 4733–4742, 2016.