Discriminate-and-Rectify Encoders: Learning from Image Transformation Sets
Andrea Tacchetti, Stephen Voinea, Georgios Evangelopoulos

TL;DR
This paper proposes a novel weakly supervised learning framework that learns transformation-robust image embeddings using orbit sets, deep parametrizations, and a new orbit-based loss, improving recognition tasks under visual variability.
Contribution
It introduces a new orbit-based loss and a framework for learning transformation-invariant embeddings from sets of transformed images, enhancing weakly supervised recognition.
Findings
Embeddings improve one-shot classification under geometric transformations.
Enhanced face verification and retrieval under visual variability.
Orbit sets enable efficient weakly-supervised learning.
Abstract
The complexity of a learning task is increased by transformations in the input space that preserve class identity. Visual object recognition for example is affected by changes in viewpoint, scale, illumination or planar transformations. While drastically altering the visual appearance, these changes are orthogonal to recognition and should not be reflected in the representation or feature encoding used for learning. We introduce a framework for weakly supervised learning of image embeddings that are robust to transformations and selective to the class distribution, using sets of transforming examples (orbit sets), deep parametrizations and a novel orbit-based loss. The proposed loss combines a discriminative, contrastive part for orbits with a reconstruction error that learns to rectify orbit transformations. The learned embeddings are evaluated in distance metric-based tasks, such as…
| OJ | EX [10] | -val | OT | -val | OE | -val | ||
|---|---|---|---|---|---|---|---|---|
| Multi-PIE AUC | ||||||||
| Top-1 | ||||||||
| Affine MNIST ACC |
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsFace recognition and analysis · Domain Adaptation and Few-Shot Learning · Generative Adversarial Networks and Image Synthesis
See memo/cbmm-cover-template.pdf
Discriminate-and-Rectify Encoders: Learning from Image Transformation Sets
Andrea Tacchetti*, Stephen Voinea* and Georgios Evangelopoulos
The Center for Brains, Minds and Machines, MIT — McGovern Institute for Brain Research at MIT,
{atacchet, voinea, gevang}@mit.edu
*denotes equal contribution
Abstract
The complexity of a learning task is increased by transformations in the input space that preserve class identity. Visual object recognition for example is affected by changes in viewpoint, scale, illumination or planar transformations. While drastically altering the visual appearance, these changes are orthogonal to recognition and should not be reflected in the representation or feature encoding used for learning. We introduce a framework for weakly supervised learning of image embeddings that are robust to transformations and selective to the class distribution, using sets of transforming examples (orbit sets), deep parametrizations and a novel orbit-based loss. The proposed loss combines a discriminative, contrastive part for orbits with a reconstruction error that learns to rectify orbit transformations. The learned embeddings are evaluated in distance metric-based tasks, such as one-shot classification under geometric transformations, as well as face verification and retrieval under more realistic visual variability. Our results suggest that orbit sets, suitably computed or observed, can be used for efficient, weakly-supervised learning of semantically relevant image embeddings.
1 Introduction
The distribution of examples for a learning problem, such as visual object recognition, will exhibit variability across and within semantic categories. The former is due to the category-specific statistics; the latter is due to the variety of instances that share the same semantics and by transformations that preserve the identity, such as geometric or photometric changes. Such transformations will alter the properties of the visual scene but will not change the semantic category of an object. Recognition across novel views (position, size, pose), clutter and occlusions [8, 26, 19], and generalization to new examples from a category, are hallmarks of human and primate perception. Invariance to transformations has been consistently explored as the computational objective of representations for computer vision and machine learning [12, 13, 5, 20, 31, 7].
Representations that facilitate generalization in downstream supervised tasks can be learned in unsupervised or semi-supervised settings [1], where the distribution of observations is used for obtaining non-linear similarity metrics, reducing the dimensionality by disregarding nuisance directions or deriving interpretable, generative models. Unsupervised learning has been used, for example, for pre-training of neural networks with the goal of improving the convergence rates of end-to-end learning algorithms.
A question of theoretical and practical interest is under which conditions can representations learned without explicit supervision [4] match the performance of supervised learning methods that implicitly account for the representation of the data [18]. Learning from unlabeled (or implicitly labeled) data can be an alternative to using labeled examples for training multi-parameter deep neural networks, and building large-scale, generic and transferable learning systems economically. In addition, biological and cognitive learning paradigms, particularly in perceptual domains such as vision or speech, predict learning and generalization from a small number of labeled examples and an abundance of implicitly labeled observations or weak supervision.
A natural source of weak supervision is the formation of equivalence relations and classes in the input space, that are not necessarily related to the learning task. Such relations can be, for example, temporal, categorical or generative and partition the space in sets which we will loosely refer to as obits in this paper. Example of orbits are the set of images of an object under rotations [1, 12] or the frames of a video of a moving object [25, 43, 24, 37]. This partition by orbit sets, in terms of granularity, lies in-between single-example and task-specific, semantic class partitions.
In this paper, we propose using orbit sets, defined by generic transformations, as weak supervision for learning representations in an invariant metric space. Two points are equivalent if one can be related to the other through a transformation, and the set of all equivalent points forms an orbit. Orbits are either explicitly generated (data augmentation) or implicitly specified (temporal continuity, data acquisition or association). As opposed to inference using an explicit transformation model [16] or factoring out the nuisance using explicit pooling [1], we learn deep, parametric embeddings using a novel loss function that incorporates the orbit equivalence relations.
The proposed orbit metric loss generalizes the triplet loss and denoising autoencoders to orbit sets, that promote approximate invariance and reconstruction. We separately study the two motivating special cases: the orbit triplet loss, a discriminative term which implicitly promotes invariance and selectivity with respect to examples drawn from the same or different orbits respectively, and the orbit encoder loss, a generative term, which learns to rectify or de-transform by mapping orbit points to a single, canonical element. The learned embeddings are compared, under the same parametrizations, to those from the supervised triplet loss [29], the surrogate class loss [10] and, when a full model of the orbit-generating process is available (e.g. affine transformations), spatial transformer networks [16].
The learned embeddings define robust metrics that are semantically relevant for distance-based and low supervised-sample regime tasks, such as ranking, retrieval, matching, clustering, graph-construction and relational learning. We provide quantitative comparisons on face verification and retrieval (on Multi-PIE dataset) and one-shot learning for classification (on MNIST with affine transformations). Our results show that partitioning the input space according to suitable orbit sets is a powerful weak supervision cue, which the proposed encoding loss can exploit effectively to learn semantically relevant embeddings.
2 Related work
Invariance to transformations that are orthogonal to the learning task has been the subject of extensive theoretical and empirical investigation in artificial and biological perception and recognition. A number of studies focused on theoretical insights on the trade-off between invariance and selectivity through sufficient statistics [31], the optimality of explicit parametrizations with convolutions/pooling and memory-based learning [1, 2], as well as constructing invariants for compact groups and maps that are robust, rather than invariant, to diffeomorphisms [22]. These inspired feature extraction architectures for object recognition [9], texture classification [5], face verification [21], action recognition [33], and speech recognition [36, 41].
In this paper, we rely on the theoretical framework in [1, 2]. While relaxing some of the assumptions, e.g. compact groups, exact invariance, we make use of generic orbit sets, that can come from implicit supervision [1], or include non-group transformations, partial orbits and noisy samples. We propose a loss function that is a proxy for a margin-based invariance and selectivity in the representation, which can be used for end-to-end trainable encoders and does not rely on a particular parametrization, nor does it require access to a model of the orbit generation process.
Side information as a form of weak supervision has been employed for various distance metric learning algorithms [39]; for example, variants of the triplet loss function which rely on knowing which of two pairs corresponds to similar samples [38, 32]. Supervised versions of the triplet loss have have been used for discriminatively-trained metric learning, through convolutional network parametrization, aiming to minimize the true objective of the task (e.g., face verification) [6, 29]. The triplet loss was also used for nonlinear dimensionality reduction and learning transformation-invariant embeddings [12]. Similar to the neighborhood graphs in [12], the orbits in our work can be obtained by side information or temporal proximity and are assumed known only for training.
Representation learning through surrogate classes populated by data augmentation transformations was explored in the exemplar CNN framework [10]. Autoencoder networks [14, 3, 4] have been typically used for dimensionality reduction (bottleneck features) and unsupervised pre-training of deep networks. Different reconstruction requirements or regularization terms have lead to useful encodings by learning to perform denoising [35], sparse coding, contractive approximations for robustness [27], or respect temporal continuity (feature slowness) [43, 25]. Convolutional autoencoders enforce local spatial robustness through max pooling [42, 23]. Our method uses a combination of triplet loss and an explicit rectification accuracy loss through an encoder-decoder network.
The idea of using a representation of the transformations for robust embeddings has been explored through explicit estimation of the parameters of an exact model of the generative process in spatial transformer networks [16], estimation of a latent representation in transforming autoencoders [13], or a distributed representation [42]. Similar to [28], the proposed orbit loss functions can be used as a regularizer of the discriminative loss of a deep network for representations that are both robust and discriminatively trained.
3 Background
We begin by reviewing relevant background concepts in order to provide context and formulate clearly the proposed losses and weakly-supervised learning methods.
Learning and representations The feature map or data representation for a learning problem on input space , can be explicitly selected or learned using principles such as
- •
distance preservation or contraction ,
- •
reconstruction ,
- •
invariance and selectivity ,.
where denotes equivalence and function composition, i.e. .
Remark 1**.**
The feature map is selected from some hypothesis space of functions . In practice, some parametrization of the elements of that renders the resulting representation learning problem tractable is necessary.
Remark 2**.**
For kernel machines (or shallow networks), is preselected, or implicitly induced by a kernel function , . For deep networks, is parametrized, typically through linear projections and non-linearities, and jointly learned with the predictor function. Moreover, it involves multiple maps in the form of compositions of multiple representation layers .
Metric learning In the general case, the global metric learning problem [17] is learning a distance function between two points as the distance in a new space :
[TABLE]
The representation can express linear, kernelized or nonlinear mappings and is obtained by the solution of a regularized, constrained minimization problem, using some form of side information, such as the similarity between pairs [39] or triplets of points [32].
Triplet loss The large margin nearest neighbor loss [38] was developed for supervised learning of a distance metric by pulling together and pushing apart same- and different-class neighbors, respectively. The closely related contrastive loss [6] uses pairs of observations and their label agreement to decrease or increase their distance by learning through:
[TABLE]
where is the size of the training set , a distance margin for the non-matching pairs and the hinge loss function. The triplet loss is based on defining point triplets using their label agreement [29] and aims to enforce
[TABLE]
by minimizing the mismatch part of the large margin loss:
[TABLE]
Autoencoders An autoencoder is composed of an encoding map and a decoding map , where we assume and be Euclidean spaces and and to be the appropriate hypothesis spaces, learned by minimizing a reconstruction loss
[TABLE]
where is typically the square or cross-entropy loss. The encoding is parametrized by a linear map using projection units or filters , an offset , and a nonlinear function applied element-wise
[TABLE]
The decoding map is typically of a similar form
[TABLE]
usually constrained having tied-weights, , for a reduced number of parameters. Both maps can have multiple layers and learned with additional priors through regularization on or the activations of the hidden layers [4, 27], reconstructing perturbations of [35] or convolutional structure on [40, 23, 42] and pooling.
Transformations and orbit sets Consider a family of transformations as a set of maps . We will denote by the action of the transformation represented by on , which generates point , i.e. . The transformations can be parametrized by , such that . The set can have algebraic structure, e.g. form a group [2, 22, 7] (Fig. 2, row 1).
Definition 1** (Group orbits [2]).**
An orbit associated to an element is the set of points that can be reached under the transformations , i.e., .
Given a group structure on , the transformations partition the input space into orbits by defining equivalence relations: . As a result, each belongs to one and only one orbit and the input space is . Using the fact that orbit are sets defined by equivalence relations in , we can extend the definition to relations or set memberships provided by categorical labels.
Definition 2** (Generic orbits).**
An orbit associated to an element is the subset of that includes along with an equivalence relation, i.e. the equivalence class . The equivalence relation is given by a function such that .
Examples of such maps are the labels of a supervised learning task, the indexes of vector quantization codewords or, for the case of sequential data such as videos, the sequence membership, with the set of classes, codewords or sequences respectively (Fig. 2, row 2).
Surrogate classes and exemplar loss The Exemplar loss, introduced as a way to combine data augmentation and weak supervision for training convolutional networks [10], uses a surrogate class for each point in an unlabeled training set . The surrogate class instances are generated by random transformations, sampled from , of the class prototypes . An embedding is learned by minimizing a discriminative loss with respect to the surrogate classes:
[TABLE]
where indexes the original, untransformed training set and serves as the surrogate class label for all points generated from ; is a classifier learnt jointly with the embedding.
Spatial transformer networks (STNs) When a plausible forward model of the process that generates orbits is known, i.e. when a suitable parametrization of is available, STNs [16] are trainable modules that learn to undo a transformation in , by explicitly transforming the input of a feature map. STNs introduce a specific modification to the parametrization of , that for an input , provides an estimate of and applies the inverse transformation . This module acts as an oracle that provides a rectified, untransformed version of its input, which is then passed to downstream embedding maps. The resulting embedding is robust, by construction, to transformations in .
4 Metric learning with orbit loss
We introduce a novel loss function for learning an embedding for a distance metric using as weak supervision the set memberships on transformation orbits. The loss aims to jointly, adaptively and in a data-driven manner enforce invariance, to the transformations captured by the orbit sets, selectivity and low rectification error on the representation. The loss function is independent of the embedding parametrization, though it implies a siamese (tied weights) and an encoder-decoder network architecture (Fig. 1). In this paper we will learn deep, convolutional encodings. Orbit sets are obtained either from explicit transformations of the unlabeled input data, e.g. each sample generates an orbit, or from a weak supervision signal involving data continuity, e.g. subsets of the training set correspond to data collected sequentially or under multiple views.
4.1 Problem statement
Let training set be a set of unlabeled instances. We assume the input to be in , for example having be the vectorized intensity values of an image. We aim to learn a feature map or embedding , in space with , such that the metric given by the distance
[TABLE]
where denotes the norm in , is invariant and selective with respect to the transformations of captured by an orbit set , equivalently written as:
[TABLE]
where we relax the exact invariance condition using an -approximation to the zero-norm distance. In terms of , Eq. (10) defines a sufficient and necessary condition for two points being equivalent under the transformation in the orbit set [2]. Note that the requirement for selectivity, i.e., the converse direction, makes a proper metric space with an invariant, in this -approximation sense, metric.
4.2 Orbit sets
The definition of the orbit sets is crucial for the proposed framework and can be based on the data distribution and the learning problem; here we give a few examples of orbit sets.
Augmentation Given a parametrized family of transformations , one can generate orbit samples for a given by randomly sampling from the parameter vectors and letting . Examples include geometric transformations (rotation, translation, scaling), e.g. Fig 2 (row 1), or typical data augmentation transforms (cropping, contrast, color, blur, illumination etc.) [10].
Acquisition If the data acquisition process is part of the learning problem, e.g. online/unsupervised learning, or included as meta-data, e.g. multiple samples of an object across time, conditions or views, e.g. Fig 2 (row 2), then an orbit can be associated to all samples from the same sequence or session [11].
Temporal continuity For sequential data such as videos, an orbit can be a continuous segment of the video stream, following plausible assumptions on feature smoothness and continuity of the representation in time [37, 43].
4.3 Orbit metric loss
Assume the set of orbits given, either via an a priori partition of the training set in a number of equivalence classes such that , or by augmentation of each such that . Given the orbits, consider a set of triplets
[TABLE]
such that each is assigned a positive example (in-orbit), i.e. and a negative example (out-of-orbit), i.e. . We further assume that each orbit is equipped with a canonical example .
Definition 3** (Orbit canonical element).**
An orbit point that provides a reference coordinate system for the family of transformations that generate the orbit. For obtained through a generative process applied on , is the output of the identity transformation , i.e. .
. For orbits from categorical meta-data, is empirically chosen to be the ‘regular’ view or neutral condition (Fig. 3, row 1).
The proposed loss function, reflected in the architecture in Fig. 1, is composed of two terms; a discriminative term , based on the triplet loss, using distances between the encodings on the feature space ; a reconstruction error between a decoder output and the canonical, as a distance on the input space :
[TABLE]
The constants , and control the relative contribution of each term and the distance margin. The loss is independent of the parameterization of but depends on the selection of the triplets for , given the orbits, and the canonical instance for each orbit.
4.4 Orbit triplet loss
For the orbit metric loss reduces to the triplet loss in Eq. (4), when similarity and dissimilarity are specified by orbit memberships. Points that lie on the same orbit are pulled together and points on different orbits are pushed apart. The minimizer will be pushed to satisfy Eq. (3), using all triplets in the training set. Note that in the theoretical minimum of , e.g. using the subgradient of the hinge loss, Eq. (3) is satisfied. The orbit triplet loss follows a Siamese network architecture [6], with a tied-weight embedding trained using triplets as input.
The following proposition shows how, for the case of bounded-norm embeddings, minimizing the triplet loss, thus pushing to minimize Eq. (3), leads to an operational definition of selective robustness to transformations (invariance) with a tolerance margin .
Proposition 1**.**
Let be in the space of functions with norm bounded by , i.e. . If Eq. (3) is true, then is invariant for the orbit transformations and selective for the orbit identities, according to the -approximate definition in Eq. (10) with .
Proof.
For a triplet , with being positive and negative examples for , Eq. (3) gives and , as by the bounded norm assumption . Since are same and different orbit elements, it holds that , i.e. satisfies (10) with . ∎
4.5 Orbit encoder
For the orbit metric loss reduces to a loss that penalizes, using an additional decoder map , the reconstruction error between the output of point to the canonical of the orbit . This is also the error of the transformation rectification that applies on the input , assumed to be the transformation of . This loss is a novel type of autoencoder loss that learns to de-transform the input.
The motivation is the generalization of denoising autoencoders, that learn to reconstruct clean versions of their noisy input, to transformations with or without an explicit generative model, using the equivalence of points within an orbit. Orbit encoders learn to de-transform an input adaptively, for all transformations in the training set orbits, by mapping points onto a pre-selected canonical orbit element (Fig. 3, top row). This provides a reference point for the set, such that every point in can be seen as or for a known transformation process. The reconstruction error is then and the minimization pushes the solution towards an ‘inversion’ of the transformation , jointly for all points in the training set. Another way to see the rectification objective is as trying to reconstruct any given from an artificially transformed version of it .
Training requires pairs and the choice of the canonical for each orbit has to be consistent only across the same semantic class of a downstream task, e.g. all orbits of the same class. The loss enforces selectivity on by preserving sufficient information to reconstruct the input irrespective of the transformation.
4.6 Parameterization of the embedding
The mapping is parameterized through multiple layers
[TABLE]
with each one being a -dimensional feature map of linear projections on filters and nonlinearities of the form in Eq. (6). The output of layer given layer is
[TABLE]
where the weight matrix and . We consider convolutional maps, where groups of filters correspond to the same local kernel shifted over the support of the input, i.e. each filter is sparse on the input (local connectivity) and the projection is a convolution operator (weight sharing). The activation in Eq. (14) is then
[TABLE]
where denotes convolution with each row of (with , where the number of shifts of the convolution kernel) and (one bias per channel).
For the nonlinearity we use the hard rectifier (ReLU activation functions) given by , which is applied element wise on the pre-activation output , i.e. . Batch normalization is applied on before as , where and are trainable parameter vectors and each dimension of is standardized to be zero mean and unit variance, using the statistics of the training mini-batch [15].
In addition, max pooling nonlinearities are introduced after a number of convolution layers in order to increase spatial invariance and decrease the feature map sizes. For each filter , the layer is looking at the corresponding support in its input and takes the maximum over sets of convolution values defined on a grid of neighboring values, i.e. .
The decoder is a series of deconvolution and un-pooling layers [40], in direct correspondence to the encoder in number of layers, units per layer, filters, size of kernels, and with tied weights such that .
5 Experiments
We compared the embeddings learned using the proposed loss, Orbit Joint (OJ) in Eq. (12) and the two special cases, Orbit Triplet (OT) () and Orbit Encode (OE) (), and three reference, closely related methods: Supervised Triplet (ST) [29], Exemplar (EX) [10], and standard Autoencoder (AE). Each loss was used for learning a map from the input to a metric space using the same network/parametrization and varying degrees of supervision (unsupervised, supervised or weakly-supervised using the set of orbits). Once the embeddings were learned, the training and test sets for the downstream tasks (one-shot digit classification on affine MNIST, face verification and retrieval on Multi-PIE) were encoded and used to evaluate the performance. The embedding set used for training the networks, i.e. the collection of orbits, was kept separate from any data used in the downstream tasks. For the affine-MNIST evaluations, we also compared our methods to an embedding parametrization featuring a Spatial Transformer Networks module [16], trained with orbit supervision (OT-STN) or full supervision (ST-STN).
5.1 Network and training details
The encoder was a deep convolutional network following the VGG architecture [30]. Each layer was a series of convolutions with a small kernel (of stride 1, padding 1), batch normalization [15] and ReLU activation. A spatial max pooling layer (of stride 2 and size either or ) was used every two such layers of convolutions. The number of channels doubled after each max pooling layer, ranging from 16 to 128 for MNIST and 64 to 512 for Multi-PIE. Four iterations of convolution and pooling were followed by a final fully-connected layer of size 1024. The decoder was a deconvolutional network, reversing the series of operations in the encoder using convolutional reconstruction and max unpooling [40]. Encoder and decoder weights were tied with free biases. Training was done with minibatch Stochastic Gradient Descent using the ADAM optimizer. For MNIST experiments we used minibatches of 256, and for Multi-PIE, 72. The selection of the triplets followed the soft negative selection process from [29]. The values for and were set equal in these experiments, but they can be selected by cross-validation. The STNs modules, used for the affine-MNIST comparisons, consisted of two max pooling-convolution-ReLU blocks with 20 filters of size (stride 1), pooling regions of size and no overlap, and followed by two linear layers.
5.2 Affine transformations: MNIST
We created a version of MNIST, using 32 random affine transformation for each point in the original MNIST dataset (samples in Fig. 3, middle row). Transformations were sampled uniformly from the union of the following intervals: rotation from , shearing factor from , scale factor from , and translation in each dimension from pixels. The orbit set consisted of the original MNIST training set ( images), augmented by 32 transformations for each sample, resulting in a total of images, grouped in orbits. Each orbit, of size 33, is the set of a single original image (canonical) and the corresponding random transformations of it.
The learned embeddings using this set were employed in a one-shot classification task to assess their invariance and selectivity properties. The training set consisted of 10 images, one from each semantic class. These were drawn at random from the original MNIST validation (augmented by 32 random affine transformations). The test set consisted of images randomly drawn from the original MNIST test set (plus transformations). Figure 4 shows the 2D t-SNE plots [34] of the learned embeddings on a random subset of the test set. Qualitatively, the best separation and grouping was observed with the fully supervised triplet loss (ST), followed by the weakly supervised orbit joint loss (OJ).
Nearest Neighbor classification was used for predicting the label of each test point from the 10 image labelled training set, which is not controlled for transformations. Figure 5 shows classification accuracy results during embedding training epochs. At each iteration, the accuracy is shown as mean with standard deviation (sdtv) error bars over 100 different labelled set selections. As expected, the supervised ST performed best and the unsupervised AE was the lower baseline. Of the weakly-supervised methods, the orbit metric loss OJ achieved the top accuracy, followed by OE and EX. Spatial transformer network modules provided a small improvement in accuracy (, consistent with the improvement reported in [16]) when used with full supervision (ST vs. ST-STN). However, when only orbit information was available (OT vs. OT-STN), there was no difference in performance. This is further reflected in Fig. 6 which shows rectification examples from the output of the STN module and the learned decoder with our method.
5.3 Face transformations: Multi-PIE
The Multi-PIE dataset [11] contains images of faces of 129 individuals, captured from 13 distinct viewpoints and under 20 different illumination conditions. Acquisition was carried out across four sessions, resulting in a dataset of images. For learning the maps from the input to the metric space, we used all images from three of the sessions to form the embedding sets and left out all images from the fourth session for performance assessment in the downstream task. During training of the embedding map , the ST method had access to the face identity for each image, thus considering all images of the same subject (across sessions, viewpoints and illumination conditions) as belonging to the same equivalence class set. The weakly supervised methods (OJ, OT, OE, EX) on the other hand, had only access to the set of orbits formed by partitioning the embedding set in orbits, each corresponding to all 13 viewpoints for a single identity, illumination condition and session (Fig. 2, row 2).
For the purpose of performance assessment, we used the learned maps and encoded the held-out test set (one session). Figure 7 shows the relative distance landscape of the learned embeddings, as 2D t-SNE plots, for all images from 10 subjects of the test set. The weakly supervised OJ appears to have similar or better grouping and separability properties than the fully supervised ST. We used two distance-based tasks to quantitatively evaluate the embedding metric spaces: a same-different face verification task and a face retrieval task. In a transformation-robust metric space for face representation, same-identity images should be closer to each other than to other identities, and the nearest neighbor to each should be an image of the same class. We measure the Area Under the ROC Curve (AUC) for verification and the mean top-1 precision for retrieval. The process of training and evaluating an embedding was repeated on all four possible 3-1 splits, across sessions of Multi-PIE, to assess the uncertainty in the performance measures.
For the verification task, we used all unique pairwise distances in the embedding space, considered all possible decision thresholds and integrated the True Positive and False Positive rates to compute the AUC. For the retrieval task, we select the closest point to a query image (top-1 retrieval) from a target search set. We considered each test image individually as query, using the rest of the test set as the target set, after removing all same-identity images (32 in total, including the query) at the same illumination (regardless of viewpoint) and at the same viewpoint (regardless of illumination). This made for a more challenging task and helped in ensuring that the embeddings are evaluated with respect to their preference of identity over appearance, e.g. by excluding candidates with strong pose or illumination bias. As a performance measure, we report the mean precision, i.e. the fraction of queries that yielded a correct retrieval.
Verification performance is shown in Fig. 9 as mean and s.d. of AUC across 3-1 splits of Multi-PIE sessions (3 for embedding training–1 for evaluation). As expected, the weakly supervised methods, that access only the orbit assignments for training, are in-between the ST loss, which has access to category-level labels and the unsupervised AE loss. Orbit triplet (OT) learns very quickly and its performance tends to decrease after a few iterations. The other methods learn at comparable rates. The joint orbit loss (OJ) achieves the best AUC score. A similar ranking holds for the retrieval task, shown as top-1 precision in Fig. 10.
5.4 Early stopping by cross validation
To evaluate the generalization performance, i.e. testing on a set of unseen examples, we trained all embeddings by applying cross-validation for individually selecting the number of iterations. We compared the proposed OJ to the state-of-the-art, weakly-supervised EX loss using six splits (4 choose 2) by session on Multi-PIE (2-Embedding – 1-Validation (VA) – 1-Test (TE)) and 10 random splits of the MNIST test set (VA – TE), with elements of the same orbit appearing in only one. We selected the stopping time that gave the best performance measure (AUC, top-1 precision, accuracy) on the VA set and evaluated the same measure on the TE set. Table 1 shows the mean and s.d. over splits, with corresponding p-values (paired t-test with Bonferroni correction) quantifying significance for the difference between OJ-EX, OJ-OT and OJ-OE. OJ consistently outperforms EX in all three generalization tasks. Moreover, the performance of OJ is either better or statistically indistinguishable (with a standard significance threshold at ) from OT and OE. This observation makes the case for the joint loss, which can result in substantial improvements like in the one-shot classification task on affine MNIST. Furthermore, it suggests that careful selection of the relative weights ( and ) of the triplet and reconstruction terms in the OJ loss (Eq. 12), e.g. via cross-validation, could be beneficial.
6 Conclusions
We introduced a loss function that combines a discriminative and a generative term for learning embeddings, using weak supervision from generic transformation orbits. We showed that the resulting image embeddings induce a metric space that is relevant for distance-based learning tasks such as one-shot learning classification, face verification and retrieval. The two loss terms serve complementary purposes, so that joint training is advantageous and supersedes state-of-the-art, exemplar-based training and, when applicable Spatial Transformer Networks. Transformations that do not alter the semantic category of the input are present in most classical perception problems, from pitch shifts in speech recognition to pose, illumination and gait changes in action recognition to reflectance properties for object categorization. The work presented here suggests that explicitly defining equivalence classes according to these transformations is a rich, weak supervision signal that can be exploited in a more general class of representation learning methods, starting from the proposed loss function, to learn semantically relevant embeddings. Such embeddings define distance functions that are robust to typical transformations and are useful for categorization, retrieval, verification and clustering. Future work should assess the relevance for other modalities, such as video or audio and the potential of acquiring the equivalence classes through time continuity.
Acknowledgements
This material is based upon work supported by the Center for Brains, Minds and Machines (CBMM), funded by NSF STC award CCF-1231216. The DGX-1 used for this research was donated by the NVIDIA Corporation. Stephen Voinea acknowledges the support of a Nuance Foundation Grant. The authors gratefully acknowledge Tomaso Poggio and Fabio Anselmi for insightful discussions.
Appendix A Appendix: Decoder output examples
Figure 3 and Fig. 8 showed examples of rectifications applied on , where is a seed, untransformed image, and an latent transformation, using the output of the decoder . While for OE, this rectification is the sole criterion that drives the learned embedding, for OJ, this competes with the triplet loss term. This will result in different learned features, and image outputs from the decoder.
The effect of the joint training in the visual appearance of the rectified outputs is shown for ten images from affine MNIST dataset in Fig. 11 and ten images from Multi-PIE in Fig. 12. Each one depicts the decoder output for standard AE (column 3), OE (column 4) and OJ (column 5). The autoencoder output is included as a sanity check of the reconstruction loss, i.e., the output is a faithful reconstruction of the input, which includes the transformation effect. The decoders for the OE and OJ losses both do well at rectifying the transformation (affine for MNIST and 3D viewpoint for Multi-PIE), i.e. mapping the transformed input (column 2) to an image that resembles the untransformed one (column 1). One can note subtle differences in the outputs, e.g., the 3 and 9 instances in MNIST but particularly in Multi-PIE, though it is not easy to select one based on visual qualities.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1[1] F. Anselmi, J. Z. Leibo, L. Rosasco, J. Mutch, A. Tacchetti, and T. Poggio. Unsupervised Learning of Invariant Representations. Theoretical Computer Science , 633:112–121, Jun 2015.
- 2[2] F. Anselmi, L. Rosasco, and T. Poggio. On Invariance and Selectivity in Representation Learning. Information and Inference , 5(2):134–158, May 2016.
- 3[3] P. Baldi. Autoencoders, unsupervised learning, and deep architectures. ICML 2012 Unsupervised and Transfer Learning Workshop , 2012.
- 4[4] Y. Bengio, A. Courville, and P. Vincent. Representation Learning: A Review and New Perspectives. IEEE Transactions on Pattern Analysis and Machine Intelligence , 35(8):1798–1828, 2013.
- 5[5] J. Bruna and S. Mallat. Invariant Scattering Convolution Networks. IEEE Transactions on Pattern Analysis and Machine Intelligence , 35(8):1872–1886, 2013.
- 6[6] S. Chopra, R. Hadsell, and Y. Le Cun. Learning a Similarity Metric Discriminatively, with Application to Face Verification. In IEEE Conf. on Computer Vision and Pattern Recognition (CVPR) , volume 1, pages 539–546, 2005.
- 7[7] T. S. Cohen and M. Welling. Group Equivariant Convolutional Networks. In International Conference on Machine Learning (ICML) , Feb 2016.
- 8[8] J. J. Di Carlo, D. Zoccolan, and N. C. Rust. How does the brain solve visual object recognition? Neuron , 73(3):415–34, feb 2012.
