TL;DR
This paper explores the advantages of hyperbolic embeddings over traditional Euclidean and spherical embeddings in various computer vision tasks, showing they often yield better performance in practical scenarios.
Contribution
It introduces the use of hyperbolic embeddings for computer vision tasks, highlighting their benefits over Euclidean and spherical methods.
Findings
Hyperbolic embeddings outperform Euclidean and spherical embeddings in several tasks.
Hyperbolic space better captures hierarchical and complex data structures.
The study demonstrates practical scenarios where hyperbolic embeddings are advantageous.
Abstract
Computer vision tasks such as image classification, image retrieval and few-shot learning are currently dominated by Euclidean and spherical embeddings, so that the final decisions about class belongings or the degree of similarity are made using linear hyperplanes, Euclidean distances, or spherical geodesic distances (cosine similarity). In this work, we demonstrate that in many practical scenarios hyperbolic embeddings provide a better alternative.
Click any figure to enlarge with its caption.
Figure 1
Figure 2
Figure 3
Figure 4
Figure 5
Figure 6
Figure 7| Tree | ||||
| Theory | ||||
| Baselines | Embedding Net | 1-Shot 5-Way | 5-Shot 5-Way |
| MatchingNet [53] | 4 Conv | 43.56 0.84% | 55.31 0.73% |
| MAML [9] | 4 Conv | 48.70 1.84% | 63.11 0.92% |
| RelationNet [48] | 4 Conv | 50.44 0.82% | 65.32 0.70% |
| REPTILE [28] | 4 Conv | 49.97 0.32% | 65.99 0.58% |
| ProtoNet [43] | 4 Conv | 49.42 0.78% | 68.20 0.66% |
| Baseline* [4] | 4 Conv | 41.08 0.70% | 54.50 0.66% |
| Spot&learn [6] | 4 Conv | 51.03 0.78% | 67.96 0.71% |
| DN4 [23] | 4 Conv | 51.24 0.74% | 71.02 0.64% |
| Hyperbolic ProtoNet | 4 Conv | 54.43 0.20% | 72.67 0.15% |
| SNAIL [27] | ResNet12 | 55.71 0.99% | 68.88 0.92% |
| ProtoNet+ [43] | ResNet12 | 56.50 0.40% | 74.2 0.20% |
| CAML [16] | ResNet12 | 59.23 0.99% | 72.35 0.71% |
| TPN [25] | ResNet12 | 59.46% | 75.65% |
| MTL [47] | ResNet12 | 61.20 1.8% | 75.50 0.8% |
| DN4 [23] | ResNet12 | 54.37 0.36% | 74.44 0.29% |
| TADAM [32] | ResNet12 | 58.50% | 76.70% |
| Qiao-WRN [34] | Wide-ResNet28 | 59.60 0.41% | 73.74 0.19% |
| LEO [38] | Wide-ResNet28 | 61.76 0.08% | 77.59 0.12% |
| Dis. k-shot [2] | ResNet34 | 56.30 0.40% | 73.90 0.30% |
| Self-Jig(SVM) [5] | ResNet50 | 58.80 1.36% | 76.71 0.72% |
| Hyperbolic ProtoNet | ResNet18 | 59.47 0.20% | 76.84 0.14% |
| Baselines | Embedding Net | 1-Shot 5-Way | 5-Shot 5-Way |
| MatchingNet [53] | 4 Conv | 61.16 0.89 | 72.86 0.70 |
| MAML [9] | 4 Conv | 55.92 0.95% | 72.09 0.76% |
| ProtoNet [43] | 4 Conv | 51.31 0.91% | 70.77 0.69% |
| MACO [15] | 4 Conv | 60.76% | 74.96% |
| RelationNet [48] | 4 Conv | 62.45 0.98% | 76.11 0.69% |
| Baseline++ [4] | 4 Conv | 60.53 0.83% | 79.34 0.61% |
| DN4-DA [23] | 4 Conv | 53.15 0.84% | 81.90 0.60% |
| Hyperbolic ProtoNet | 4 Conv | 64.02 0.24% | 82.53 0.14% |
| Market-1501 | DukeMTMC-reID | |||||||
| Euclidean | Hyperbolic | Euclidean | Hyperbolic | |||||
| dim, lr schedule | r1 | mAP | r1 | mAP | r1 | mAP | r1 | mAP |
| 32, sch#1 | 71.4 | 49.7 | 69.8 | 45.9 | 56.1 | 35.6 | 56.5 | 34.9 |
| 32, sch#2 | 68.0 | 43.4 | 75.9 | 51.9 | 57.2 | 35.7 | 62.2 | 39.1 |
| 64, sch#1 | 80.3 | 60.3 | 83.1 | 60.1 | 69.9 | 48.5 | 70.8 | 48.6 |
| 64, sch#2 | 80.5 | 57.8 | 84.4 | 62.7 | 68.3 | 45.5 | 70.7 | 48.6 |
| 128, sch#1 | 86.0 | 67.3 | 87.8 | 68.4 | 74.1 | 53.3 | 76.5 | 55.4 |
| 128, sch#2 | 86.5 | 68.5 | 86.4 | 66.2 | 71.5 | 51.5 | 74.0 | 52.2 |
| ProtoNet | Hyperbolic ProtoNet | |
| -shot -way | ||
| -shot -way | ||
| -shot -way | ||
| -shot -way |
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Hyperbolic Image Embeddings· youtube
\useunder
Hyperbolic Image Embeddings
Valentin Khrulkov1,4 Leyla Mirvakhabova111footnotemark: 1 Evgeniya Ustinova1
Ivan Oseledets1,2 **Victor Lempitsky1,3
**Skolkovo Institute of Science and Technology (Skoltech), Moscow1
Institute of Numerical Mathematics of the Russian Academy of Sciences, Moscow2
Samsung AI Center, Moscow3
Yandex, Moscow4
{valentin.khrulkov,leyla.mirvakhabova,evgeniya.ustinova,i.oseledets,lempitsky}@skoltech.ru Equal contribution
Abstract
Computer vision tasks such as image classification, image retrieval, and few-shot learning are currently dominated by Euclidean and spherical embeddings so that the final decisions about class belongings or the degree of similarity are made using linear hyperplanes, Euclidean distances, or spherical geodesic distances (cosine similarity). In this work, we demonstrate that in many practical scenarios, hyperbolic embeddings provide a better alternative.
1 Introduction
Learned high-dimensional embeddings are ubiquitous in modern computer vision. Learning aims to group together semantically-similar images and to separate semantically-different images. When the learning process is successful, simple classifiers can be used to assign an image to classes, and simple distance measures can be used to assess the similarity between images or image fragments. The operations at the end of deep networks imply a certain type of geometry of the embedding spaces. For example, image classification networks [19, 22] use linear operators (matrix multiplication) to map embeddings in the penultimate layer to class logits. The class boundaries in the embedding space are thus piecewise-linear, and pairs of classes are separated by Euclidean hyperplanes. The embeddings learned by the model in the penultimate layer, therefore, live in the Euclidean space. The same can be said about systems where Euclidean distances are used to perform image retrieval [31, 44, 58], face recognition [33, 57] or one-shot learning [43].
Alternatively, some few-shot learning [53], face recognition [41], and person re-identification methods [52, 59] learn spherical embeddings, so that sphere projection operator is applied at the end of a network that computes the embeddings. Cosine similarity (closely associated with sphere geodesic distance) is then used by such architectures to match images.
Euclidean spaces with their zero curvature and spherical spaces with their positive curvature have certain profound implications on the nature of embeddings that existing computer vision systems can learn. In this work, we argue that hyperbolic spaces with negative curvature might often be more appropriate for learning embedding of images. Towards this end, we add the recently-proposed hyperbolic network layers [11] to the end of several computer vision networks, and present a number of experiments corresponding to image classification, one-shot, and few-shot learning and person re-identification. We show that in many cases, the use of hyperbolic geometry improves the performance over Euclidean or spherical embeddings.
Our work is inspired by the recent body of works that demonstrate the advantage of learning hyperbolic embeddings for language entities such as taxonomy entries [29], common words [50], phrases [8] and for other NLP tasks, such as neural machine translation [12]. Our results imply that hyperbolic spaces may be as valuable for improving the performance of computer vision systems.
Motivation for hyperbolic image embeddings.
The use of hyperbolic spaces in natural language processing [29, 50, 8] is motivated by the ubiquity of hierarchies in NLP tasks. Hyperbolic spaces are naturally suited to embed hierarchies (e.g., tree graphs) with low distortion [40, 39]. Here, we argue that hierarchical relations between images are common in computer vision tasks (Figure 2):
- •
In image retrieval, an overview photograph is related to many images that correspond to the close-ups of different distinct details. Likewise, for classification tasks in-the-wild, an image containing the representatives of multiple classes is related to images that contain representatives of the classes in isolation. Embedding a dataset that contains composite images into continuous space is, therefore, similar to embedding a hierarchy.
- •
In some tasks, more generic images may correspond to images that contain less information and are therefore more ambiguous. E.g., in face recognition, a blurry and/or low-resolution face image taken from afar can be related to many high-resolution images of faces that clearly belong to distinct people. Again natural embeddings for image datasets that have widely varying image quality/ambiguity calls for retaining such hierarchical structure.
- •
Many of the natural hierarchies investigated in natural language processing transcend to the visual domain. E.g., the visual concepts of different animal species may be amenable for hierarchical grouping (e.g. most felines share visual similarity while being visually distinct from pinnipeds).
Hierarchical relations between images call for the use of Hyperbolic spaces. Indeed, as the volume of hyperbolic spaces expands exponentially, it makes them continuous analogues of trees, in contrast to Euclidean spaces, where the expansion is polynomial. It therefore seems plausible that the exponentially expanding hyperbolic space will be able to capture the underlying hierarchy of visual data.
In order to build deep learning models which operate on the embeddings to hyperbolic spaces, we capitalize on recent developments [11], which construct the analogues of familiar layers (such as a feed–forward layer, or a multinomial regression layer) in hyperbolic spaces. We show that many standard architectures used for tasks of image classification, and in particular in the few–shot learning setting can be easily modified to operate on hyperbolic embeddings, which in many cases also leads to their improvement.
The main contributions of our paper are twofold:
- •
First, we apply the machinery of hyperbolic neural networks to computer vision tasks. Our experiments with various few-shot learning and person re-identification models and datasets demonstrate that hyperbolic embeddings are beneficial for visual data.
- •
Second, we propose an approach to evaluate the hyperbolicity of a dataset based on the concept of Gromov -hyperbolicity. It further allows estimating the radius of Poincaré disk for an embedding of a specific dataset and thus can serve as a handy tool for practitioners.
2 Related work
Hyperbolic language embeddings.
Hyperbolic embeddings in the natural language processing field have recently been very successful [29, 30]. They are motivated by the innate ability of hyperbolic spaces to embed hierarchies (e.g., tree graphs) with low distortion [39, 40]. However, due to the discrete nature of data in NLP, such works typically employ Riemannian optimization algorithms in order to learn embeddings of individual words to hyperbolic space. This approach is difficult to extend to visual data, where image representations are typically computed using CNNs.
Another direction of research, more relevant to the present work, is based on imposing hyperbolic structure on activations of neural networks [11, 12]. However, the proposed architectures were mostly evaluated on various NLP tasks, with correspondingly modified traditional models such as RNNs or Transformers. We find that certain computer vision problems that heavily use image embeddings can benefit from such hyperbolic architectures as well. Concretely, we analyze the following tasks.
Few–shot learning.
The task of few–shot learning is concerned with the overall ability of the model to generalize to unseen data during training. Most of the existing state-of-the-art few–shot learning models are based on metric learning approaches, utilizing the distance between image representations computed by deep neural networks as a measure of similarity [53, 43, 48, 28, 4, 6, 23, 2, 38, 5]. In contrast, other models apply meta-learning to few-shot learning: e.g., MAML by [9], Meta-Learner LSTM by [35], SNAIL by [27]. While these methods employ either Euclidean or spherical geometries (like in [53]), there was no extension to hyperbolic spaces.
Person re-identification.
The task of person re-identification is to match pedestrian images captured by possibly non-overlapping surveillance cameras. Papers [1, 13, 56] adopt the pairwise models that accept pairs of images and output their similarity scores. The resulting similarity scores are used to classify the input pairs as being matching or non-matching. Another popular direction of work includes approaches that aim at learning a mapping of the pedestrian images to the Euclidean descriptor space. Several papers, e.g., [46, 59] use verification loss functions based on the Euclidean distance or cosine similarity. A number of methods utilize a simple classification approach for training [3, 45, 17, 60], and Euclidean distance is used in test time.
3 Reminder on hyperbolic spaces and hyperbolicity estimation.
Formally, -dimensional hyperbolic space denoted as is defined as the homogeneous, simply connected -dimensional Riemannian manifold of constant negative sectional curvature. The property of constant negative curvature makes it analogous to the ordinary Euclidean sphere (which has constant positive curvature); however, the geometrical properties of the hyperbolic space are very different. It is known that hyperbolic space cannot be isometrically embedded into Euclidean space [18, 24], but there exist several well–studied models of hyperbolic geometry. In every model, a certain subset of Euclidean space is endowed with a hyperbolic metric; however, all these models are isomorphic to each other, and we may easily move from one to another base on where the formulas of interest are easier. We follow the majority of NLP works and use the Poincaré ball model.
The Poincaré ball model is defined by the manifold endowed with the Riemannian metric , where is the conformal factor and is the Euclidean metric tensor . In this model the geodesic distance between two points is given by the following expression:
[TABLE]
In order to define the hyperbolic average, we will make use of the Klein model of hyperbolic space. Similarly to the Poincaré model, it is defined on the set , however, with a different metric, not relevant for further discussion. In Klein coordinates, the hyperbolic average (generalizing the usual Euclidean mean) takes the most simple form, and we present the necessary formulas in Section 4.
From the viewpoint of hyperbolic geometry, all points of Poincaré ball are equivalent. The models that we consider below are, however, hybrid in the sense that most layers use Euclidean operators, such as standard generalized convolutions, while only the final layers operate within the hyperbolic geometry framework. The hybrid nature of our setups makes the origin a special point, since, from the Euclidean viewpoint, the local volumes in Poincare ball expand exponentially from the origin to the boundary. This leads to the useful tendency of the learned embeddings to place more generic/ambiguous objects closer to the origin while moving more specific objects towards the boundary. The distance to the origin in our models, therefore, provides a natural estimate of uncertainty, that can be used in several ways, as we show below.
This choice is justified for the following reasons. First, many existing vision architectures are designed to output embeddings in the vicinity of zero (e.g., in the unit ball). Another appealing property of hyperbolic space (assuming the standard Poincare ball model) is the existence of a reference point – the center of the ball. We show that in image classification which construct embeddings in the Poincare model of hyperbolic spaces the distance to the center can serve as a measure of confidence of the model — the input images which are more familiar to the model get mapped closer to the boundary, and images which confuse the model (e.g., blurry or noisy images, instances of a previously unseen class) are mapped closer to the center. The geometrical properties of hyperbolic spaces are quite different from the properties of the Euclidean space. For instance, the sum of angles of a geodesic triangle is always less than . These interesting geometrical properties make it possible to construct a “score” which for an arbitrary metric space provides a degree of similarity of this metric space to a hyperbolic space. This score is called -hyperbolicity, and we now discuss it in detail.
3.1 -Hyperbolicity
Let us start with an illustrative example. The simplest discrete metric space possessing hyperbolic properties is a tree (in the sense of graph theory) endowed with the natural shortest path distance. Note the following property: for any three vertices , the geodesic triangle (consisting of geodesics — paths of shortest length connecting each pair) spanned by these vertices (see Figure 4) is slim, which informally means that it has a center (vertex ) which is contained in every side of the triangle. By relaxing this condition to allow for some slack value and considering so-called -slim triangles, we arrive at the following general definition.
Let be an arbitrary (metric) space endowed with the distance function . Its -hyperbolicity value then may be computed as follows. We start with the so-called Gromov product for points :
[TABLE]
Then, is defined as the minimal value such that the following four-point condition holds for all points :
[TABLE]
The definition of hyperbolic space in terms of the Gromov product can be seen as saying that the metric relations between any four points are the same as they would be in a tree, up to the additive constant . -Hyperbolicity captures the basic common features of “negatively curved” spaces like the classical real-hyperbolic space and of discrete spaces like trees.
For practical computations, it suffices to find the value for some fixed point as it is independent of . An efficient way to compute is presented in [10]. Having a set of points, we first compute the matrix of pairwise Gromov products using Equation 2. After that, the value is simply the largest coefficient in the matrix , where denotes the min-max matrix product
[TABLE]
Results.
In order to verify our hypothesis on hyperbolicity of visual datasets we compute the scale-invariant metric, defined as , where denotes the set diameter (maximal pairwise distance). By construction, and specifies how close is a dataset to a hyperbolic space. Due to computational complexities of Equations 2 and 4 we employ the batched version of the algorithm, simply sampling points from a dataset, and finding the corresponding . Results are averaged across multiple runs, and we provide resulting mean and standard deviation. We experiment on a number of toy datasets (such as samples from the standard two–dimensional unit sphere), as well as on a number of popular computer vision datasets. As a natural distance between images, we used the standard Euclidean distance between feature vectors extracted by various CNNs pretrained on the ImageNet (ILSVRC) dataset [7]. Specifically, we consider VGG19 [42], ResNet34 [14] and Inception v3 [49] networks for distance evaluation. While other metrics are possible, we hypothesize that the underlying hierarchical structure (useful for computer vision tasks) of image datasets can be well understood in terms of their deep feature similarity.
Our results are summarized in Table 2. We observe that the degree of hyperbolicity in image datasets is quite high, as the obtained are significantly closer to [math] than to (which would indicate complete non-hyperbolicity). This observation suggests that visual tasks can benefit from hyperbolic representations of images.
Relation between -hyperbolicity and Poincaré disk radius.
It is known [50] that the standard Poincaré ball is -hyperbolic with . Formally, the diameter of the Poincaré ball is infinite, which yields the value of [math]. However, from computational point of view we cannot approach the boundary infinitely close. Thus, we can compute the effective value of for the Poincaré ball. For the clipping value of , i.e., when we consider only the subset of points with the (Euclidean) norm not exceeding , the resulting diameter is equal to . This provides the effective . Using this constant we can estimate the radius of Poincaré disk suitable for an embedding of a specific dataset. Suppose that for some dataset we have found that its is equal to . Then we can estimate as follows.
[TABLE]
For the previously studied datasets, this formula provides an estimate of . In our experiments, we found that this value works quite well; however, we found that sometimes adjusting this value (e.g., to ) provides better results, probably because the image representations computed by deep CNNs pretrained on ImageNet may not have been entirely accurate.
4 Hyperbolic operations
Hyperbolic spaces are not vector spaces in a traditional sense; one cannot use standard operations as summation, multiplication, etc. To remedy this problem, one can utilize the formalism of Möbius gyrovector spaces allowing to generalize many standard operations to hyperbolic spaces. Recently proposed hyperbolic neural networks adopt this formalism to define the hyperbolic versions of feed-forward networks, multinomial logistic regression, and recurrent neural networks [11]. In Appendix A, we discuss these networks and layers in detail, and in this section, we briefly summarize various operations available in the hyperbolic space. Similarly to the paper [11], we use an additional hyperparameter which modifies the curvature of Poincaré ball; it is then defined as . The corresponding conformal factor now takes the form . In practice, the choice of allows one to balance between hyperbolic and Euclidean geometries, which is made precise by noting that with , all the formulas discussed below take their usual Euclidean form. The following operations are the main building blocks of hyperbolic networks.
Möbius addition.
For a pair , the Möbius addition is defined as follows:
[TABLE]
Distance.
The induced distance function is defined as
[TABLE]
Note that with one recovers the geodesic distance (1), while with we obtain the Euclidean distance
Exponential and logarithmic maps.
To perform operations in the hyperbolic space, one first needs to define a bijective map from to in order to map Euclidean vectors to the hyperbolic space, and vice versa. The so-called exponential and (inverse to it) logarithmic map serves as such a bijection.
The exponential map is a function from to , which is given by
[TABLE]
The inverse logarithmic map is defined as
[TABLE]
In practice, we use the maps and for a transition between the Euclidean and Poincaré ball representations of a vector.
Hyperbolic averaging.
One important operation common in image processing is averaging of feature vectors, used, e.g., in prototypical networks for few–shot learning [43]. In the Euclidean setting this operation takes the form . Extension of this operation to hyperbolic spaces is called the Einstein midpoint and takes the most simple form in Klein coordinates:
[TABLE]
where are the Lorentz factors. Recall from the discussion in Section 3 that the Klein model is supported on the same space as the Poincaré ball; however, the same point has different coordinate representations in these models. Let and denote the coordinates of the same point in the Poincaré and Klein models correspondingly. Then the following transition formulas hold.
[TABLE]
Thus, given points in the Poincaré ball, we can first map them to the Klein model, compute the average using Equation (10), and then move it back to the Poincaré model.
Numerical stability.
While implementing most of the formulas described above is straightforward, we employ some tricks to make the training more stable. In particular, to ensure numerical stability, we perform clipping by norm after applying the exponential map, which constrains the norm not to exceed .
5 Experiments
Experimental setup.
We start with a toy experiment supporting our hypothesis that the distance to the center in Poincaré ball indicates a model uncertainty. To do so, we first train a classifier in hyperbolic space on the MNIST dataset [21] and evaluate it on the Omniglot dataset [20]. We then investigate and compare the obtained distributions of distances to the origin of hyperbolic embeddings of the MNIST and Omniglot test sets.
In our further experiments, we concentrate on the few-shot classification and person re-identification tasks. The experiments on the Omniglot dataset serve as a starting point, and then we move towards more complex datasets. Afterwards, we consider two datasets, namely: MiniImageNet [35] and Caltech-UCSD Birds-200-2011 (CUB) [54]. Finally, we provide the re-identification results for the two popular datasets: Market-1501 [61] and DukeMTMD [36, 62]. Further in this section, we provide a thorough description of each experiment. Our code is available at github111https://github.com/leymir/hyperbolic-image-embeddings.
5.1 Distance to the origin as the measure of uncertainty
In this subsection, we validate our hypothesis, which claims that if one trains a hyperbolic classifier, then the distance of the Poincaré ball embedding of an image to the origin can serve as a good measure of confidence of a model. We start by training a simple hyperbolic convolutional neural network on the MNIST dataset (we hypothesized that such a simple dataset contains a very basic hierarchy, roughly corresponding to visual ambiguity of images, as demonstrated by a trained network on Figure 1). The output of the last hidden layer was mapped to the Poincaré ball using the exponential map (8) and was followed by the hyperbolic multi-linear regression (MLR) layer [11].
After training the model to test accuracy, we evaluate it on the Omniglot dataset (by resizing its images to and normalizing them to have the same background color as MNIST). We then evaluated the hyperbolic distance to the origin of embeddings produced by the network on both datasets. The closest Euclidean analogue to this approach would be comparing distributions of , maximum class probability predicted by the network. For the same range of dimensions, we train ordinary Euclidean classifiers on MNIST and compare these distributions for the same sets. Our findings are summarized in Figure 5 and Table 3. We observe that distances to the origin represent a better indicator of the dataset dissimilarity in three out of four cases.
We have visualized the learned MNIST and Omniglot embeddings in Figure 1. We observe that more “unclear” images are located near the center, while the images that are easy to classify are located closer to the boundary.
5.2 Few–shot classification
We hypothesize that a certain class of problems — namely the few-shot classification task can benefit from hyperbolic embeddings, due to the ability of hyperbolic space to accurately reflect even very complex hierarchical relations between data points. In principle, any metric learning approach can be modified to incorporate the hyperbolic embeddings. We decided to focus on the classical approach called prototypical networks (ProtoNets) introduced in [43]. This approach was picked because it is simple in general and simple to convert to hyperbolic geometry. ProtoNets use the so-called prototype representation of a class, which is defined as a mean of the embedded support set of a class. Generalizing this concept to hyperbolic space, we substitute the Euclidean mean operation by , defined earlier in (10). We show that Hyperbolic ProtoNets can achieve results competitive with many recent state-of-the-art models. Our main experiments are conducted on MiniImageNet and Caltech-UCSD Birds-200-2011 (CUB). Additional experiments on the Omniglot dataset, as well as the implementation details and hyperparameters, are provided in Appendix B. For a visualization of learned embeddings see Appendix C.
MiniImageNet.
MiniImageNet dataset is the subset of ImageNet dataset [37] that contains classes represented by examples per class. We use the following split provided in the paper [35]: the training dataset consists of classes, the validation dataset is represented by classes, and the remaining classes serve as the test dataset. We test the models on tasks for 1-shot and 5-shot classifications; the number of query points in each batch always equals to . Similarly to [43], the model is trained in the 30-shot regime for the 1-shot task and the 20-shot regime for the 1-shot task. We test our approach with two different backbone CNN models: a commonly used four-block CNN [43, 4] (denoted ‘4 Conv’ in the table) and ResNet18 [14]. To find the best values of hyperparameters, we used the grid search; see Appendix B for the complete list of values.
Table 4 illustrates the obtained results on the MiniImageNet dataset (alongside other results in the literature). Interestingly, Hyperbolic ProtoNet significantly improves accuracy as compared to the standard ProtoNet, especially in the one-shot setting. We observe that the obtained accuracy values, in many cases, exceed the results obtained by more advanced methods, sometimes even in the case of architecture of larger capacity. This partly confirms our hypothesis that hyperbolic geometry indeed allows for more accurate embeddings in the few–shot setting.
Caltech-UCSD Birds.
The CUB dataset consists of images of bird species and was designed for fine-grained classification. We use the split introduced in [51]: classes out of were used for training, for validation and for testing. Due to the relative simplicity of the dataset, we consider only the 4-Conv backbone and do not modify the training shot values as was done for the MiniImageNet case. The full list of hyperparameters is provided in Appendix B.
Our findings are summarized in Table 5. Interestingly, for this dataset, the hyperbolic version of ProtoNet significantly outperforms its Euclidean counterpart (by more than 10% in both settings), and outperforms many other algorithms.
5.3 Person re-identification
The DukeMTMC-reID dataset [36, 62] contains training images of identities, query images of identities and gallery images. The Market1501 dataset [61] contains training images of identities, queries of identities and gallery images respectively. We report Rank1 of the Cumulative matching Characteristic Curve and Mean Average Precision for both datasets. The results (Table 6) are reported after the training epochs. The experiments were performed with the ResNet50 backbone, and two different learning rate schedulers (see Appendix B for more details). The hyperbolic version generally performs better than the Euclidean baseline, with the advantage being bigger for smaller dimensionality.
6 Discussion and conclusion
We have investigated the use of hyperbolic spaces for image embeddings. The models that we have considered use Euclidean operations in most layers, and use the exponential map to move from the Euclidean to hyperbolic spaces at the end of the network (akin to the normalization layers that are used to map from the Euclidean space to Euclidean spheres). The approach that we investigate here is thus compatible with existing backbone networks trained in Euclidean geometry.
At the same time, we have shown that across a number of tasks, in particular in the few-shot image classification, learning hyperbolic embeddings can result in a substantial boost in accuracy. We speculate that the negative curvature of the hyperbolic spaces allows for embeddings that are better conforming to the intrinsic geometry of at least some image manifolds with their hierarchical structure.
Future work may include several potential modifications of the approach. We have observed that the benefit of hyperbolic embeddings may be substantially bigger in some tasks and datasets than in others. A better understanding of when and why the use of hyperbolic geometry is warranted is therefore needed. Finally, we note that while all hyperbolic geometry models are equivalent in the continuous setting, fixed-precision arithmetic used in real computers breaks this equivalence. In practice, we observed that care should be taken about numeric precision effects. Using other models of hyperbolic geometry may result in a more favourable floating point performance.
Acknowledgements
This work was funded by the Ministry of Science and Education of Russian Federation as a part of Mega Grant Research Project 14.756.31.000.
Appendix A Hyperbolic Neural Networks
Linear layer.
Assume we have a standard (Euclidean) linear layer . In order to generalize it, one needs to define the Möbius matrix by vector product:
[TABLE]
if , and otherwise. Finally, for a bias vector the operation underlying the hyperbolic linear layer is then given by .
Concatenation of input vectors.
In several architectures (e.g., in siamese networks), it is needed to concatenate two vectors; such operation is obvious in Euclidean space. However, straightforward concatenation of two vectors from hyperbolic space does not necessarily remain in hyperbolic space. Thus, we have to use a generalized version of the concatenation operation, which is then defined in the following manner. For , we define the mapping as follows.
[TABLE]
where and are trainable matrices of sizes and correspondingly. The motivation for this definition is simple: usually, the Euclidean concatenation layer is followed by a linear map, which when written explicitly takes the (Euclidean) form of Equation (14).
Multiclass logistic regression (MLR).
In our experiments, to perform the multiclass classification, we take advantage of the generalization of multiclass logistic regression to hyperbolic spaces. The idea of this generalization is based on the observation that in Euclidean space logits can be represented as the distances to certain hyperplanes, where each hyperplane can be specified with a point of origin and a normal vector. The same construction can be used in the Poincaré ball after a suitable analogue for hyperplanes is introduced. Given and , such an analogue would be the union of all geodesics passing through and orthogonal to .
The resulting formula for hyperbolic MLR for classes is written below; here and are learnable parameters.
[TABLE]
For a more thorough discussion of hyperbolic neural networks, we refer the reader to the paper [11].
Appendix B Experiment details
Omniglot.
As a baseline model, we consider the prototype network (ProtoNet). Each convolutional block consists of convolutional layer followed by batch normalization, ReLU nonlinearity and max-pooling layer. The number of filters in the last convolutional layer corresponds to the value of the embedding dimension, for which we choose . The hyperbolic model differs from the baseline in the following aspects. First, the output of the last convolutional block is embedded into the Poincaré ball of dimension using the exponential map. Results are presented in Table 7. We can see that in some scenarios, in particular for one-shot learning, hyperbolic embeddings are more beneficial, while in other cases, results are slightly worse. The relative simplicity of this dataset may explain why we have not observed a significant benefit of hyperbolic embeddings. We further test our approach on more advanced datasets.
miniImageNet.
We performed the experiments with two different backbones, namely the previously discussed 4-Conv model and ResNet18. For the former, embedding dim was set to 1024 and for the latter to 512. For the one-shot setting both models were trained for epochs with Adam optimizer, learning rate being and step learning rate decay with the factor of and step size being epochs. For the 4-Conv model we used and for ResNet18 we used . For 4-Conv in the five-shot setting we used the same hyperparameters except for and learning rate decay step being epochs. For ResNet18 we additionally changed learning rate to and step size to .
Caltech-UCSD Birds.
For these experiments we used the same 4-Conv architecture with the embedding dimensionality being . For the one-shot task, we used learning rate , , learning rate step being epochs and decay rate of . For the five-shot task, we used learning rate , , learning rate step of and decay rate of .
Person re-identification.
We use ResNet50 [14] architecture with one fully connected embedding layer following the global average pooling. Three embedding dimensionalities are used in our experiments: , and . For the baseline experiments, we add the additional classification linear layer, followed by the cross-entropy loss. For the hyperbolic version of the experiments, we map the descriptors to the Poincaré ball and apply multiclass logistic regression as described in Section 4. We found that in both cases the results are very sensitive to the learning rate schedules. We tried four schedules for learning -dimensional descriptors for both baseline and hyperbolic versions. The two best performing schedules were applied for the and -dimensional descriptors. In these experiments, we also found that smaller values give better results. We therefore have set to . Based on the discussion in 4, our hyperbolic setting is quite close to Euclidean. The results are compiled in Table 6. We set starting learning rates to and for and correspondingly and multiply them by after each of the epochs and .
Appendix C Visualizations
For the visual inspection of embeddings we computed projections of high dimensional embeddings obtained from the trained few–shot models with the (hyperbolic) UMAP algorithm [26] (see Figure 6). We observe that different classes are neatly positioned near the boundary of the circle and are well separated.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1[1] Ejaz Ahmed, Michael J. Jones, and Tim K. Marks. An improved deep learning architecture for person re-identification. In Conf. Computer Vision and Pattern Recognition, CVPR , pages 3908–3916, 2015.
- 2[2] Matthias Bauer, Mateo Rojas-Carulla, Jakub Bartlomiej Swiatkowski, Bernhard Scholkopf, and Richard E Turner. Discriminative k-shot learning using probabilistic models. ar Xiv preprint ar Xiv:1706.00326 , 2017.
- 3[3] Xiaobin Chang, Timothy M Hospedales, and Tao Xiang. Multi-level factorisation net for person re-identification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition , pages 2109–2118, 2018.
- 4[4] Wei-Yu Chen, Yen-Cheng Liu, Zsolt Kira, Yu-Chiang Frank Wang, and Jia-Bin Huang. A closer look at few-shot classification. In ICLR , 2019.
- 5[5] Zitian Chen, Yanwei Fu, Kaiyu Chen, and Yu-Gang Jiang. Image block augmentation for one-shot learning. In AAAI , 2019.
- 6[6] Wen-Hsuan Chu, Yu-Jhe Li, Jing-Cheng Chang, and Yu-Chiang Frank Wang. Spot and learn: A maximum-entropy patch sampler for few-shot image classification. In CVPR , pages 6251–6260, 2019.
- 7[7] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In CVPR , pages 248–255. IEEE, 2009.
- 8[8] Bhuwan Dhingra, Christopher J Shallue, Mohammad Norouzi, Andrew M Dai, and George E Dahl. Embedding text in hyperbolic spaces. ar Xiv preprint ar Xiv:1806.04313 , 2018.
