Asymmetric Feature Maps with Application to Sketch Based Retrieval
Giorgos Tolias, Ond\v{r}ej Chum

TL;DR
This paper introduces asymmetric feature maps (AFM) for efficient, scale and translation invariant sketch-based image retrieval, enabling multiple kernel evaluations without extra memory and providing query localization.
Contribution
The paper presents a novel AFM approach that improves retrieval efficiency and accuracy, including a new image representation and a faster approximation method for translation search.
Findings
Achieves an order of magnitude speed-up over traditional methods
Outperforms state-of-the-art on standard benchmarks
Provides query localization in retrieved images
Abstract
We propose a novel concept of asymmetric feature maps (AFM), which allows to evaluate multiple kernels between a query and database entries without increasing the memory requirements. To demonstrate the advantages of the AFM method, we derive a short vector image representation that, due to asymmetric feature maps, supports efficient scale and translation invariant sketch-based image retrieval. Unlike most of the short-code based retrieval systems, the proposed method provides the query localization in the retrieved image. The efficiency of the search is boosted by approximating a 2D translation search via trigonometric polynomial of scores by 1D projections. The projections are a special case of AFM. An order of magnitude speed-up is achieved compared to traditional trigonometric polynomials. The results are boosted by an image-based average query expansion, exceeding significantly the…
Click any figure to enlarge with its caption.
Figure 38
Figure 38
Figure 3
Figure 4
Figure 5
Figure 6
Figure 7
Figure 8
Figure 9
Figure 10
Figure 11
Figure 12
Figure 13
Figure 14
Figure 15
Figure 16
Figure 17
Figure 18
Figure 19
Figure 20
Figure 21
Figure 22
Figure 23
Figure 24
Figure 25
Figure 26
Figure 27
Figure 28
Figure 29
Figure 30
Figure 31
Figure 32
Figure 33
Figure 34
Figure 35
Figure 36
Figure 37
Figure 38
Figure 39
Figure 40| Method | Dim | Time | DB | P@5 | @10 | @25 | @50 |
|---|---|---|---|---|---|---|---|
|
(1.2M) |
(8,3) | 55.4 | 15.3 | 43.2 | 40.9 | 37.2 | 33.8 |
|
(1.2M) |
(5,2) | 20.2 | 3.3 | 25.8 | 24.7 | 22.5 | 20.2 |
| (1.2M) | (8,3) | 55.4 | 5.1 | 50.1 | 46.7 | 42.0 | 37.2 |
| (1.2M) | (5,2) | 20.2 | 1.1 | 45.8 | 44.1 | 38.5 | 35.4 |
| (50k) | (6,3) | 3.5 | 2.8 | 49.7 | 47.4 | 41.3 | 36.8 |
| (50k) | (6,3) | 2.5 | 2.8 | 49.6 | 47.3 | 41.0 | 36.6 |
| (50k)† | (6,3) | 2.5 | 0.7 | 50.3 | 47.3 | 41.5 | 36.7 |
| (50k) | (5,2) | 2.5 | 1.1 | 45.8 | 44.2 | 38.4 | 35.3 |
| (50k) | (5,2) | 1.7 | 1.1 | 45.7 | 44.2 | 38.3 | 35.1 |
| (50k)† | (5,2) | 1.7 | 0.3 | 45.6 | 43.5 | 38.0 | 35.0 |
| (50k)†+QE3 | (6,3) | 2.7 | 0.8 | 55.2 | 57.4 | 57.4 | 57.5 |
| (50k)†+QE10 | (6,3) | 2.7 | 0.8 | 63.0 | 63.4 | 64.8 | 65.2 |
| (50k)†+QE3 | (5,2) | 1.9 | 0.4 | 50.9 | 52.2 | 52.5 | 52.4 |
| (50k)†+QE10 | (5,2) | 1.9 | 0.4 | 56.4 | 56.8 | 57.3 | 57.8 |
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Image and Video Retrieval Techniques · Image Retrieval and Classification Techniques · Multimodal Machine Learning Applications
Asymmetric Feature Maps with Application to Sketch Based Retrieval
Giorgos Tolias Ondřej Chum
Visual Recognition Group, Faculty of Electrical Engineering, Czech Technical University in Prague
{giorgos.tolias,chum}@cmp.felk.cvut.cz
Abstract
We propose a novel concept of asymmetric feature maps (AFM), which allows to evaluate multiple kernels between a query and database entries without increasing the memory requirements. To demonstrate the advantages of the AFM method, we derive a short vector image representation that, due to asymmetric feature maps, supports efficient scale and translation invariant sketch-based image retrieval. Unlike most of the short-code based retrieval systems, the proposed method provides the query localization in the retrieved image. The efficiency of the search is boosted by approximating a 2D translation search via trigonometric polynomial of scores by 1D projections. The projections are a special case of AFM. An order of magnitude speed-up is achieved compared to traditional trigonometric polynomials. The results are boosted by an image-based average query expansion, exceeding significantly the state of the art on standard benchmarks.
1 Introduction
Efficient match kernel [3] is a popular choice in applications evaluating complex similarity measures on large collections of objects, where an object is a set of elements. This includes local feature descriptors [3, 5] and image retrieval with short descriptors [38]111The authors were supported by the MSMT LL1303 ERC-CZ grant..
In efficient match kernel, all elements of the sets are mapped to a finite feature map [27, 39]. An inner product of the feature maps approximates evaluation of a specific kernel, defining similarity of the set elements. We propose an extension to this concept. In the asymmetric feature map, the query uses a different embedding than the database objects. The query embedding defines the kernel that is evaluated between the query and the database entries. Thus, multiple kernels can be evaluated while the memory requirements for the database remains the same (up to a scalar per kernel) as for a single kernel to be evaluated. The embeddings are obtained via joint kernel feature map optimization, which significantly improves the quality of kernel approximation for a fixed dimensionality of the feature map.
The application domain of AFM is wide, in particular any method using efficient match kernel benefits from AFM. We evaluate the AFM on a sketch-based retrieval application. Sketch-based retrieval has received less attention than image retrieval and still remains challenging. Instead of a real image, the query consists of an abstract binary sketch. This allows the user to quickly outline an object, e.g. by a finger on a tablet or smart phone, and search for relevant images (see Figure 1). The progress in this area has more or less followed the footsteps of traditional image retrieval. The first systems employed global descriptors [8]. Then, the Bag-of-Words paradigm with local descriptors and feature quantization [17, 16, 30] was adopted.
Due to the absence of textural cues on the query side, the image representations are shape based. Bridging the representation gap between hand-drawn sketches and real images is one of the challenges making the task difficult. Matching based on shape information has been addressed previously. For instance, in object recognition and detection [2, 17, 23], a costly online matching is performed, which prevented the methods to scale to large image collections. Recent methods manage to index million [7] to billion [36] images for sketch-based retrieval, at the cost of sacrificed invariance to geometric transformations.
To demonstrate the impact of the AFM, we propose a short vector image representation allowing to index large image collections for sketch-based search. Scale and translation invariant real-time search allows to process an order of millions of images per one processor thread. The AFM based method achieves state-of-the-art results on standard benchmarks. The method runs at speed comparable to previously published approaches tailored to sketch-based search. Compared with methods based on efficient match kernel [38], the proposed method achieves order of magnitude speed-up. Unlike most of the methods using low-dimensional descriptors, the proposed method delivers localization of the object in both scale and space. The scale invariance is achieved by evaluating multiple kernels without the need to store multiple representations for database images. The translation invariance and object localization is provided by an efficient similarity evaluation on a 2D grid of translations. Namely, the four main contributions of this work are as follows. (1) Asymmetric explicit feature maps allowing the use of multiple kernel functions without constructing multiple representations for database items are proposed. (2) A joint kernel approximation approach for multiple kernels is derived, generalizing a recent approach of low dimensional explicit feature maps (LDFM) [9]. (3) The scoring through trigonometric polynomial introduced in [38] is further extended and a significant speed-up of its evaluation is proposed. (4) State-of-the-art sketch-based image retrieval based on the AFM, which is further boosted by query expansion which acts, not on the edge maps as standard sketch matching, but on the original images.
The rest of paper is organized as follows. Related work is discussed in Section 2 and the necessary background is presented in Section 3. Sections 4 and 5 describe our contributions on asymmetric explicit feature maps and on sketch retrieval, respectively, while the retrieval procedure and the experimental evaluation are analyzed in Section 6.
2 Related work
The most similar work to ours is the approach of Tolias et al. [38], where the trigonometric polynomial scores were introduced in the context of image retrieval (see Section 3.3 for technical details). Shape properties of local features, such as dominant orientation or position, are jointly encoded with the SIFT descriptor. Despite initially assuming aligned objects, their kernel descriptor comes with an efficient way to compute similarity over multiple image transformations. Compared to their method, asymmetric feature maps introduced in our paper: i) reduce the memory requirements of multi-scale search by roughly a factor of 3, and ii) achieve an order of magnitude speed-up through approximate translation search. The trigonometric polynomials have been also used by Bursuc et al. [5] in the context of rotation invariant feature descriptors. The descriptor has recently shown competitive results with CNN based approaches [1].
Since we demonstrate the advantages of AFM on sketch based retrieval, we provide a brief review of relevant literature on this topic. The line of research that focuses on sketches includes recognition [14, 40] or retrieval [24] of sketches. This paper addresses sketch-based image retrieval, which tries to match sketch queries to real images from a large collection. Following successful examples of traditional image retrieval, sketch-based methods employ global image representation [8, 29] or local descriptors and the Bag-of-Words model. In the latter case, representative methods employ local descriptors that are traditionally used on images [16, 30] or proposed particularly for this task [15, 28, 18, 6]. Some examples are HOG descriptors which are adapted for sketch retrieval [18] and were recently extended to capture color [4], symmetry-aware and flip invariant descriptors [6], and descriptors based on local contour fragments [28]. Generic approaches performing learning of discriminative features have been shown effective for sketch retrieval too [33].
Chamfer matching appears to be a good similarity measure for object shapes [37]. Recent attempts focus on Chamfer matching approximations in order to increase scalability. Cao et al. [7] binarize the distance transform map and manage to index two million images. However, their approach completely lacks invariance. The same holds for the work of Sun et al. [36] who increase the scale of the indexed collection up to one billion. Despite the achievement of scalability, rough approximations of Chamfer matching sacrifice accuracy. Recently, Parui and Mittal [25] proposed a similarity invariant approach able to index up to one million images. Their solution is based on dynamic programming to match chains of contour lines, while the main drawback is the costly off-line indexing.
3 Background
We briefly review the necessary background, which includes efficient match kernels [3], explicit feature maps [39] and efficient trigonometric polynomial scores [38].
3.1 Efficient Match Kernels
In many situations, an object is described by a set of measurements . Employing a mapping to the elements of , the set representation of efficient match kernels is defined as
[TABLE]
Then, a dot product between the set representation yields the similarity between sets
[TABLE]
Normalized similarity is computed by cosine similarity [38], i.e., dot product of normalized vectors,
[TABLE]
while another choice is to normalize by the set cardinality [3]. Herein, the cosine similarity is adopted ensuring self-similarity is normalized to one. A number of image representations, such as BOW [35, 11], Fisher vectors [26], or VLAD [20], can be interpreted as efficient match kernels.
3.2 Explicit feature maps
Let be a one-dimensional ( is now scalar) positive definite stationary kernel [32] . The value of a stationary kernel by definition depends only on the difference ,
[TABLE]
where is a signature of kernel . Due to Bochner’s theorem, kernel signature can be written as
[TABLE]
where . The kernel signature is approximated by sum over a finite set of frequencies
[TABLE]
where . Applying the trigonometric identity
[TABLE]
gives rise to feature map (or feature embedding) defined as
[TABLE]
The inner product of two such vectors reconstructs the terms of equation (6) since . Let the feature map be constructed as a concatenation of for all . Now, the inner product
[TABLE]
evaluates the approximation of the kernel signature (6) and hence approximates the original kernel . The choice of the number of frequencies determines the quality of the approximation and the dimensionality of the embedding. The dimensionality is , or if 222If , then for all can be dropped from the explicit feature map..
Feature map construction.
We mention in detail (and compare) two approaches to construct the explicit feature maps. We do not consider random feature maps [27], which approximate the integral in (5) using Monte-Carlo methods. Such feature maps provide a poor approximation for low-dimensional feature maps.
Vedaldi and Zisserman [39] propose the following approximation to a kernel signature on an interval . First, a periodic function with period is constructed, so that for . The feature map is then efficiently obtained by approximating periodic using harmonic frequencies only. This approach has been shown sub-optimal [9]. Further, the periodic function is not even guaranteed to be positive definite.
A convex optimization approach is proposed by Chum [9]. The input domain of is discretized to finite set . The quality of the approximation is measured at points in as, for example, an norm
[TABLE]
The set of frequencies are selected from a pool of frequencies , and corresponding weights , jointly through a solution of a linear program
[TABLE]
where is a weight on the regularizer controlling the trade-off between the quality of the approximation and the sparsity of . This is the method we adopt and extend in this work.
3.3 Alignment using trigonometric polynomials
Tolias et al. [38] propose an image representation derived by efficient match kernels and explicit feature maps. We focus on the case that all measurements of set are shifted by a constant value ; note that measurements are now scalars. The similarity under such shift forms a trigonometric polynomial
[TABLE]
with . Parameters and are given by dot products of relevant sub-vectors of and . Finally the similarity measure that is invariant under such shifting is given by .
We postpone further analysis of polynomials of scores until the image representation is introduced in Section 5.
4 Asymmetric feature maps
In this section, we introduce the concept of asymmetric feature maps. Unlike in classical explicit feature maps, a different feature map is used on the query side and a different one is used on the database side. We show that with asymmetric feature maps, a number of different kernels can be efficiently evaluated between query and database vectors while keeping the database storage of fixed size. Compare the feature map in equation (8) to the following feature maps for the query and database side respectively
[TABLE]
The inner products are preserved. The kernel function is fully defined by the weights on the query side. No additional storage is required on the database side to evaluate the kernel. The same holds for efficient match kernels, as (1) is a normalized sum of feature maps. To evaluate the cosine similarity (3), only a single scalar per kernel needs to be stored for each database entry – the norm , which is computed offline.
Joint approximation of multiple kernels.
In order to evaluate a number of different kernels using the asymmetric feature maps, all respective explicit feature maps have to be based on the same set of frequencies . A naive approach would be to optimize the set of frequencies for one of the kernels and keep it fixed for other kernels. This approach, however, leads to poor approximation, as shown in Figure 2. We propose an extension to LDFM [9] to jointly approximate a set of kernels represented by their respective kernel signatures , . The quality of the approximation is measured by the sum of individual qualities (10)
[TABLE]
The optimization is performed by executing a linear program
[TABLE]
where is a weight of the sparsity regularizer that controls the number of frequencies used, i.e. the dimensionality of the feature map. Following the approach of Chum [9], to ensure the required dimensionality of the feature map, a binary search for is performed.
Figure 2 presents the approximation of three different kernels using the same set of frequencies. We compare the approximation using only harmonic frequencies, the naive approximation mentioned above, and our joint approximation. The latter has a significantly better fit.
5 Sketch-Based Retrieval
In this section we present our sketch descriptor employing explicit feature maps and elaborate on the efficient trigonometric polynomial of scores to further approximate it. Our methodology is presented for the symmetric feature maps, while the asymmetric case is equivalent. We finally present efficient ways to perform the initial ranking and re-ranking for sketch-based image retrieval.
5.1 Sketch descriptor
Consider a binary sketch as a set of contour points, that is a set of pixels that lie on the contour. A contour pixel is represented as , where and are 2D image coordinates, is the gradient angle (or orientation) of the contour at , and is a strength of the gradient. For real images, the contour parameters are obtained form an edge detector. For sketches, is set for all contour pixels.
The similarity between contour pixels is computed using a multiplicative kernel composed of three one-dimensional kernels, spatial kernels over , , and an orientation kernel over . The 1D stationary kernels are denoted , , and respectively. The sketch descriptor is a weighted sum of contour pixel feature maps333We use to denote both the spatial and orientation feature map and simplify the notation. In fact, and approximate the spatial kernels and , respectively, which are identical, while the orientation kernel .
[TABLE]
It is easy to show that sketch similarity (2) becomes
[TABLE]
The orientation and spatial kernels are implemented by 1D RBF kernels with parameters and , respectively. The set of frequencies are denoted by and , while the dimensionality of the corresponding embeddings is and , respectively. Note that frequency is always included. The sketch descriptor has dimensionality .
The proposed representation constitutes a holistic representation encoding the global sketch shape. We now define a representation encoding only one of the spatial coordinates along with the orientation. It is equivalent to the projection of contour pixels on the horizontal/vertical image axis. The sketch descriptor derived by projection on the horizontal axis is given by
[TABLE]
where the can be omitted and is only used to show, that the -projection is a sub-vector of (16) and hence a special case of the proposed asymmetric feature map. This stems from the presence of the constant component of the feature map for , corresponding to . An analogous derivation holds for and vertical projection.
5.2 Position alignment
The sketch descriptor encodes spatial coordinates and orientation of contour pixels. Therefore, alignment of objects is assumed, i.e. centered and up-right objects. Such an assumption does not hold in real image collections and introduces significant limitations. We now detail the polynomial of scores (mentioned in Section 3) proposed by Tolias et al. [38]. We show that translation invariance is achieved by polynomial of scores, and that its evaluation can be efficiently approximated to speed up the search process.
One dimensional.
Consider the -projected sketch descriptor . Let be the shifted version sketch where all contour pixels are horizontally translated by . Elementary trigonometric identities allow us to show that
[TABLE]
where and denote the first and second dimension of (8), respectively. Let be the sub-vector of comprised all elements that contain term , and similarly for . It turns out that the descriptor of the translated sketch is constructed from that of the original sketch
[TABLE]
The sketch similarity between sketches and under horizontal translation is a trigonometric polynomial
[TABLE]
with coefficients and
[TABLE]
The coefficients and of this polynomial are computed by two products of sub-vectors with dimensions. In total there are coefficients to be computed. Finally, similarity for any translation with (21) has cost equal to scalar multiplications. If the candidate translations are fixed, then terms and can be pre-computed. Normalized similarity comes at no extra cost since the norm of sketch descriptor remains constant under translations ( is a stationary kernel):
[TABLE]
Similarity that is invariant to horizontal translation is computed by maximizing (21) for all possible translations
[TABLE]
Note that this similarity is also invariant to vertical translation as coordinate is not encoded at all. However, this makes the representation less discriminative. The actual sketch transformation aligning the two shapes is given by . Similarity based on the vertical projection is defined in a similar way.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1[1] Local features: State of the art, open problems and performance evaluation. http://www.iis.ee.ic.ac.uk/Computer Vision/Descr Workshop/ .
- 2[2] A. C. Berg, T. L. Berg, and J. Malik. Shape matching and object recognition using low distortion correspondences. In CVPR , 2005.
- 3[3] L. Bo and C. Sminchisescu. Efficient match kernel between sets of features for visual recognition. In NIPS , Dec. 2009.
- 4[4] T. Bui and J. Collomosse. Scalable sketch-based image retrieval using color gradient features. In ICCV , 2015.
- 5[5] A. Bursuc, G. Tolias, and H. Jégou. Kernel local descriptors with implicit rotation matching. In ICMR , 2015.
- 6[6] X. Cao, H. Zhang, S. Liu, X. Guo, and L. Lin. Sym-fish: A symmetry-aware flip invariant sketch histogram shape descriptor. In ICCV . IEEE, 2013.
- 7[7] Y. Cao, C. Wang, L. Zhang, and L. Zhang. Edgel index for large-scale sketch-based image search. In CVPR . IEEE, 2011.
- 8[8] A. Chalechale, G. Naghdy, and A. Mertins. Sketch-based image matching using angular partitioning. Systems, Man and Cybernetics, Part A: Systems and Humans, IEEE Transactions on , 35(1):28–41, 2005.
