Collaborative Quantization for Cross-Modal Similarity Search
Ting Zhang, Jingdong Wang

TL;DR
This paper introduces a novel cross-modal quantization method that jointly learns quantizers for images and texts in a shared space, significantly improving the efficiency and accuracy of cross-modal similarity search.
Contribution
It is among the first to incorporate quantization into cross-modal search by jointly learning modality-specific quantizers and a shared space for improved retrieval performance.
Findings
Achieves state-of-the-art results on benchmark datasets.
Demonstrates superior efficiency over existing methods.
Effectively aligns cross-modal representations for accurate search.
Abstract
Cross-modal similarity search is a problem about designing a search system supporting querying across content modalities, e.g., using an image to search for texts or using a text to search for images. This paper presents a compact coding solution for efficient search, with a focus on the quantization approach which has already shown the superior performance over the hashing solutions in the single-modal similarity search. We propose a cross-modal quantization approach, which is among the early attempts to introduce quantization into cross-modal search. The major contribution lies in jointly learning the quantizers for both modalities through aligning the quantized representations for each pair of image and text belonging to a document. In addition, our approach simultaneously learns the common space for both modalities in which quantization is conducted to enable efficient and effective…
| Task | Method | Wiki | FLICKR | NUS-WIDE | |||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 16 bits | 32 bits | 64 bits | 128 bits | 16 bits | 32 bits | 64 bits | 128 bits | 16 bits | 32 bits | 64 bits | 128 bits | ||
| Img to Txt | CMSSH [1] | 0.2110 | 0.2115 | 0.1932 | 0.1909 | 0.6468 | 0.6616 | 0.6681 | 0.6624 | 0.5243 | 0.5210 | 0.5211 | 0.4813 |
| CVH [8] | 0.1947 | 0.1798 | 0.1732 | 0.1912 | 0.6450 | 0.6363 | 0.6273 | 0.6204 | 0.5352 | 0.5254 | 0.5011 | 0.4705 | |
| MLBE [37] | 0.3537 | 0.3947 | 0.2599 | 0.2247 | 0.6085 | 0.5866 | 0.5841 | 0.5883 | 0.4472 | 0.4540 | 0.4703 | 0.4026 | |
| QCH [30] | 0.1490 | 0.1726 | 0.1621 | 0.1611 | 0.5722 | 0.5780 | 0.5618 | 0.5567 | 0.5090 | 0.5270 | 0.5208 | 0.5135 | |
| LSSH [38] | 0.2396 | 0.2336 | 0.2405 | 0.2373 | 0.6328 | 0.6403 | 0.6451 | 0.6511 | 0.5368 | 0.5527 | 0.5674 | 0.5723 | |
| CMFH [4] | 0.2548 | 0.2591 | 0.2594 | 0.2651 | 0.5886 | 0.6067 | 0.6343 | 0.6550 | 0.4740 | 0.4821 | 0.5130 | 0.5068 | |
| (CMFH [4]) | (0.2538) | (0.2582) | (0.2619) | (0.2648) | —— | —— | —— | —— | —— | —— | —— | —— | |
| (CCQ [10]) | (0.2513) | (0.2529) | (0.2587) | —— | —— | —— | —— | —— | —— | —— | —— | —— | |
| CMCQ | 0.2478 | 0.2513 | 0.2567 | 0.2614 | 0.6705 | 0.6716 | 0.6782 | 0.6821 | 0.5637 | 0.5902 | 0.5990 | 0.6096 | |
| Txt to Img | CMSSH [1] | 0.2446 | 0.2505 | 0.2387 | 0.2352 | 0.6123 | 0.6400 | 0.6382 | 0.6242 | 0.4177 | 0.4259 | 0.4187 | 0.4203 |
| CVH [8] | 0.3186 | 0.2354 | 0.2046 | 0.2085 | 0.6595 | 0.6507 | 0.6463 | 0.6580 | 0.5601 | 0.5439 | 0.5160 | 0.4821 | |
| MLBE [37] | 0.3336 | 0.3993 | 0.4897 | 0.2997 | 0.5937 | 0.6182 | 0.6550 | 0.6392 | 0.4352 | 0.4888 | 0.5020 | 0.4425 | |
| QCH [30] | 0.1924 | 0.1561 | 0.1800 | 0.1917 | 0.5752 | 0.6002 | 0.5757 | 0.5723 | 0.5099 | 0.5172 | 0.5092 | 0.5089 | |
| LSSH [38] | 0.5776 | 0.5886 | 0.5998 | 0.6103 | 0.6504 | 0.6726 | 0.6965 | 0.7010 | 0.6357 | 0.6638 | 0.6820 | 0.6926 | |
| CMFH [4] | 0.6153 | 0.6363 | 0.6411 | 0.6504 | 0.5873 | 0.6019 | 0.6477 | 0.6623 | 0.5109 | 0.5643 | 0.5896 | 0.5943 | |
| (CMFH [4]) | (0.6116) | (0.6298) | (0.6398) | (0.6477) | —— | —— | —— | —— | —— | —— | —— | —— | |
| (CCQ [10]) | (0.6351) | (0.6394) | (0.6405) | —— | —— | —— | —— | —— | —— | —— | —— | —— | |
| CMCQ | 0.6397 | 0.6474 | 0.6546 | 0.6593 | 0.7248 | 0.7335 | 0.7394 | 0.7550 | 0.6898 | 0.7086 | 0.7194 | 0.7254 | |
| (A) Comparison between ours and SePHkm (the best version of SePH) | |||||
| Dataset | Task | Method | |||
| NUS-WIDE | Img to Txt | SePHkm | |||
| CMCQ | |||||
| Txt to Img | SePHkm | ||||
| CMCQ | |||||
| (B) Generalization to “newly-coming classes”: ours outperforms SePHkm | |||||
| NUS-WIDE | Img to Txt | SePHkm | |||
| CMCQ | |||||
| Txt to Img | SePHkm | ||||
| CMCQ | |||||
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Image and Video Retrieval Techniques · Multimodal Machine Learning Applications · Image Retrieval and Classification Techniques
Collaborative Quantization for Cross-Modal Similarity Search
Ting Zhang1 Jingdong Wang2
1University of Science and Technology of China, China 2Microsoft Research, China
[email protected] [email protected] This work was done when Ting Zhang was an intern at MSR.
Abstract
Cross-modal similarity search is a problem about designing a search system supporting querying across content modalities, e.g., using an image to search for texts or using a text to search for images. This paper presents a compact coding solution for efficient search, with a focus on the quantization approach which has already shown the superior performance over the hashing solutions in the single-modal similarity search. We propose a cross-modal quantization approach, which is among the early attempts to introduce quantization into cross-modal search. The major contribution lies in jointly learning the quantizers for both modalities through aligning the quantized representations for each pair of image and text belonging to a document. In addition, our approach simultaneously learns the common space for both modalities in which quantization is conducted to enable efficient and effective search using the Euclidean distance computed in the common space with fast distance table lookup. Experimental results compared with several competitive algorithms over three benchmark datasets demonstrate that the proposed approach achieves the state-of-the-art performance.
1 Introduction
Similarity search has been a fundamental problem in information retrieval and multimedia search. Classical approaches, however, are designed to address the single-modal search problem [25, 24, 27, 28, 13, 14], where, for instance, the text query is used to search in a text database, or the image query is used to search in an image database. In this paper, we deal with the cross-modal similarity search problem, which is an important problem emerged in multimedia information retrieval, for example, using a text query to retrieve images or using an image query to retrieve texts.
We study the compact coding solutions to cross-modal similarity search, in particular focusing on a common real-world scenario, image and text modalities. Compact coding is an approach of converting the database items into short codes on which similarity search can be efficiently conducted. It has been widely studied in single-modal similarity search with typical solutions including hashing [3, 15, 23] and quantization [5, 6, 16, 34, 26, 35], while relatively unexplored in cross-modal search except a few hashing approaches [1, 8, 11, 38]. We are interested in the quantization approach that represents each point by a short code formed by the index of the nearest center, as quantization has shown more powerful representation ability than hashing in single-modal search.
Rather than performing the quantization directly in the original feature space, we learn a common space for both modalities with the goal that the pair of image and text lie in the learnt common space closely. Learning such a common space is important and useful for the subsequent quantization whose similarity is computed based on the Euclidean distance. Similar observation has also been made in some hashing techniques [17, 18, 38] that apply the sign function on the learnt common space.
In this paper, we propose a novel approach for cross-modal similarity search, called collaborative quantization, that conducts the quantization simultaneously for both modalities in the common space, to which the database items of both modalities are mapped through matrix factorization. The quantization and the common space mapping are jointly optimized for both modalities under the objective that the quantized approximations of the descriptors of an image and a text forming a pair in the search database are well aligned. Our approach is one of the early attempts to introduce quantization into cross-modal similarity search offering the superior search performance. Experimental results on several standard datasets show that our approach outperforms existing cross-modal hashing and quantization algorithms.
2 Related work
There are two categories of compact coding approaches for cross-modal similarity search: cross-modal hashing and cross-modal quantization.
Cross-modal hashing often maps multi-modal data into a common Hamming space so that the hash codes of different modalities are directly comparable using the Hamming distance. After mapping, each document may have just one unified hash code, in which all the modalities of the document are mapped, or may have two separate hash codes, each corresponding to a modality. The main research problem in cross-modal hashing, besides hash function design that is also studied in single-modal search, is how to exploit and build the relations between the modalities. In general, the relations of multi-modal data, besides the intra-modality relation in the single modality (image vs. image and text vs. text) and the inter-modality relation across the modalities (image vs. text), also include intra-document (the correspondence of an image and a text forming a document, which is a special kind of inter-modality) and inter-document (document vs. document). A brief categorization is shown in Table 1.
The early approach, data fusion hashing [1], is a pairwise cross-modal similarity sensitive approach, which aligns the similarities (defined as inner product) in the Hamming space across the modalities, with the given inter-modality similar and dissimilar relations using the maximizing similarity-agreement criterion. An alternative formulation using the minimizing similarity-difference criterion is introduced in [32]. Co-regularized hashing [36] uses a smoothly clipped inverted squared deviation function to connect the inter-modality relation with the similarity over the projections that form the hashing codes. Similar regularization techniques are adopted for multi-modal hashing in [12]. In addition to the inter-modality similarities, several other hashing techniques, such as multimodal similarity-preserving hashing [11], sparse hashing approach [31], a probabilistic model for hashing [37], also explore and utilize the intra-modality relation to learn the hash codes for each modality.
Cross-view hashing [8] defines the distance between documents in the Hamming space by considering the hash codes of all the modalities, and aligns it with the given inter-document similarity. Multi-view spectral hashing [7] adopts a similar formulation but with a different optimization algorithm. These methods usually also involve the intra-document relation in an implicit way by considering the multi-modal document as an integrated whole object. There are other hashing methods exploring the inter-document relation about multi-modal representation , but not for cross-modal similarity search, such as composite hashing [33] and effective multiple feature hashing [19].
The intra-document relation is often used to learn a unified hash code, into which a hash function is learnt for each modality to map the feature. For example, Latent semantic sparse hashing [38] applies the sign function on the joint space projected from the latent semantic representation learnt for each modality. Collective matrix factorization hashing [4] finds the common (same) representation for an image-text pair via collective matrix factorization, and obtains the hash codes directly using the sign function on the common representation. Other methods exploring the intra-document relation include semantic topic multimodal hashing [22], semantics-preserving multi-view hashing [9], inter-media hashing [33] and its accelerated version [39], and so on. Meanwhile, several attempts [29, 21] have been made based on the neural network which can also be combined with our approach to learn the common space.
Recently, a few techniques based on quantization are developed for cross-modal search. Quantized correlation hashing [30] combines the hash function learning with the quantization by minimizing the inter-modality similarity disagreement as well as the binary quantization simultaneously. Compositional correlation quantization [10] projects the multi-modal data into a common space, and then obtains a unified quantization representation for each document. Our approach, also exploring the intra-document relation, belongs to this cross-modal quantization category and achieves the state-of-the-art performance.
3 Formulation
We study the similarity search problem over a database of documents with two modalities: image and text. Each document is a pair of image and text, , where is a -dimensional feature vector describing an image, and is a -dimensional feature vector describing a text. Splitting the database yields two databases each formed by images and texts separately, i.e., and . Given a image (text) query (), the goal of cross-modality similarity search is to retrieve the closest match in the text (image) database: ().
Rather than directly quantizing the feature vectors and to and , which requires a further non-trivial scheme to learn the similarity for vectors and with different dimensions, we are interested in finding the common space for both image and text, and jointly quantizing the image and text descriptors in the common space, so that the Euclidean distance which is widely-used in single-modal similarity search, can also be used for the cross-modal similarity evaluation.
Collaborative quantization. Suppose the images and the texts in the -dimensional common space are represented as and . For each modality, we propose to adopt composite quantization [34] to quantize the vectors in the common space. Composite quantization aims to approximate the images as by minimizing
[TABLE]
Here, corresponds to the dictionaries, corresponds to the th dictionary of size and each column is a dictionary element. with , and is a -dimensional binary ([math],) vector with only -valued entry indicating that the corresponding element in the th dictionary is selected to compose . The texts in the common space are approximated as , and the meaning of the symbols is similar to that in the images.
Besides the quantization quality, we explore the intra-document correlation between images and texts for the quantization: the image and the text forming a document are close after quantization, which is the bridge to connect images and texts for cross-modal search. We adopt the following simple formulation that minimizes the distance between the image and the corresponding text,
[TABLE]
The overall collaborative quantization formulation is given as follows,
[TABLE]
where is a trade-off variable to balance the quantization quality and the correlation degree.
Common space mapping. The common space mapping problem aims to map the data in different modalities into the same space so that the representations in cross-modalities are comparable. In our problem, we want to map the -dimensional image data and the -dimensional text data to the same -dimensional data: and .
We choose the matrix-decomposition solution as in [38]: the image data is approximated using sparse coding as a product of two matrices , and the sparse code is shown to be a good representation of the raw feature ; the text data is also decomposed into two matrices, and , where is the low-dimensional representation; In addition, a transformation matrix is introduced to align the image sparse code with the text code by minimizing , and the image in the common space is represented as . The objective function for common space mapping is written as follows,
[TABLE]
Here is the sparse term, and determines the sparsity degree; is used to balance the scales of image and text representations; is a trade-off parameter to control the approximation degree in each modality and the alignment degree for the pair of image and text.
Overall objective function. In summary, the overall formulation of the proposed cross-modal quantization is,
[TABLE]
where and represent the parameters in quantization and mapping, i.e., and respectively. The constraints in Equation 6 and Equation 7 are introduced for fast distance computation as in composite quantization [34], and more details about the search process are presented in Section 4.3.
4 Optimization
We optimize the Problem 3 by alternatively solving two sub-problems: common space mapping with the quantization parameters fixed: , and collaborative quantization with the mapping parameters fixed: . Each of the two sub-problems is solved again by a standard iteratively alternative algorithm.
4.1 Common space mapping
The objective function of the common space mapping with the quantization parameters fixed is,
[TABLE]
The iteration details are given below.
Update . The objective function with respect to is an unconstrained quadratic optimization problem, and is solved by the following closed-form solution,
[TABLE]
where is the identity matrix.
Update . The objective function with respect to can be transformed to,
[TABLE]
which is solved using the sparse learning with efficient projections package111http://parnec.nuaa.edu.cn/jliu/largeScaleSparseLearning.htm.
Update . The algorithms for updating are the same, as we can see from the following formulas,
[TABLE]
All of the above three learning problems are minimizing the quadratically constrained least square problem, which has been well studied in numerical optimization field and can be readily solved using the primal-dual conjugate gradient method.
4.2 Collaborative quantization
The second sub-problem is transformed to an unconstrained formulation by adding the equality constraints as a penalty regularization with a penalty parameter ,
[TABLE]
which is solved by alternatively updating each variable with others fixed.
Update (). The optimization procedures for and are essentially the same, so we only show how to optimize . We adopt the L-BFGS222http://www.ece.northwestern.edu/˜nocedal/lbfgs.html algorithm, one of the most frequently-used quasi-Newton methods, to solve the unconstrained non-linear problem with respect to . The derivative of the objective function is ,
[TABLE]
where .
Update . With other variables fixed, it is easy to get the optimal solution,
[TABLE]
Update (). The binary vectors given other variables fixed are independent with each other, and hence the optimization problem can be decomposed into sub-problems,
[TABLE]
This problem is a mixed-binary-integer problem generally considered as NP-hard. As a result, we approximately solve this problem by greedily updating the indicating vectors in cycle: fixing , is updated by exhaustively checking all the elements in , finding the element such that the objective function is minimized, and setting the corresponding entry of to be 1 and all the others to be 0. Similar optimization procedure is adopted to update .
4.3 Search process
In cross-modal search, the given query can be either an image or a text, which require different querying processes.
Image query. If the query is an image, , we first obtain the representation in the common space, ,
[TABLE]
The approximated distance between the image query and the database text (represented as ) is,
[TABLE]
The last term is constant for all the texts due to the introduced equality constraint in Equation 7. Hence given , it is enough to compute the first term to search for the nearest neighbors, which furthermore can be efficiently computed and takes by looking up a precomputed distance table storing the distances: .
Text query. When the query comes as a text, , the representation is obtained by solving,
[TABLE]
Using to search in the image database is similar to that in the image query search process.
5 Discussions
Relation to compositional correlation quantization. The proposed approach is close to compositional correlation quantization [10], which is also a quantization-based method for cross-modal search. In fact, our approach differs from it in two ways: (1) we find a different mapping function to project the common space; (2) we learn separate quantized centers for a pair using two dictionaries instead of the unified quantized centers in compositional correlation quantization [10] imposed with a harder alignment using one dictionary. Hence, during the quantization stage, our approach can obtain potentially smaller quantization error, as the quantized center is more flexible, and thus produce better search performance. The empirical comparison illustrating the effect of dictionary is shown in Figure 2.
Relation to latent semantic sparse hashing. In our formulation, the common space is learnt in a similar manner with latent semantic sparse hashing [38]. After the common space mapping, latent semantic sparse hashing applies a simple sign function directly on the common space, which can result in large information loss and hence weaken the search performance. Our approach, however, adopts the quantization technique that has more accurate distance approximation than hashing, and produces better cross-modal search quality than latent semantic sparse hashing, which is verified in our experiments shown in Table 2 and Figure 3.
6 Experiments
6.1 Setup
Datasets. We evaluate our method on three benchmark datasets. The first dataset, Wiki333http://www.svcl.ucsd.edu/projects/crossmodal/ consists of 2,866 images and 2,866 texts describing the images in short paragraph (at least 70 words), with images represented as 128-dimensional SIFT features and texts expressed as 10-dimensional topics vectors. This dataset is divided into 2,173 image-text pairs and 693 quries, and each pair is labeled with one of the 10 semantic classes. The second dataset, FLICKR25K444http://www.cs.toronto.edu/ nitish/multimodal/index.html, is composed of 25,000 images along with the user assigned tags. The average number of tags for an image is 5.15 [21]. Each image-text pair is assigned with multiple labels from a total of 38 classes. As in [21], the images are represented by 3857-dimensional features and the texts are 2000-dimensional vectors indicating the occurrence of the tags. We randomly sampled of the pairs as the test set and use the remaining as the training set. The third dataset is NUS-WIDE555http://lms.comp.nus.edu.sg/research/NUS-WIDE.htm [2] containing 269,648 images with associated tags (6 in average), each pair is annotated with multiple labels among 81 concepts. As done in previous work [4, 36, 38], we select 10 most popular concepts resulting in 186,577 data pairs. The images are represented by 500-dimensional bag-of-words features based on SIFT descriptors, and the texts are 1000-dimensional vectors of the most frequent tags. Following [38], We use 4000 () randomly sampled pairs as the query set and the rest as the training set.
Evaluation. In our experiments, we report the results of two search tasks for the cross-modal search, i.e., the image (as the query) to text (as the database) task and the text to image task. The search quality is evaluated with two measures: MAP and precision. MAP is defined as the mean of the average precisions of all the queries, and the average precision of a query is computed as, , where is the number of retrieved items, is the precision at position for query , and if the retrieved th item has the same label with query or shares at least one label, otherwise . Following [38, 4, 10], we report MAP with and . We also plot the precision curve which is obtained by computing the precisions at different recall levels through varying the number of retrieved items.
Compared methods. We compare our approach, Cross-Modal Collaborative Quantization (CMCQ), with three baseline methods that only use the intra-document relation: Latent Semantic Sparse Hashing (LSSH) [38], Collective Matrix Factorization Hashing (CMFH) [4], and Compositional Correlation Quantization (CCQ) [10]. The code of LSSH is generously provided by the authors and we implemented the CMFH carefully by ourselves. The performance of CCQ (without public code) is presented partially using the results in its paper. In addition, we report the state-of-the-art algorithms whose codes are publicly available: (1) Cross-Modal Similarity Sensitive Hashing (CMSSH) [1], (2) Cross-View Hashing (CVH) [8], (3) Multimodal Latent Binary Embedding (MLBE) [37], (4) Quantized Correlation Hashing (QCH) [30]. The parameters in above methods are set according to the corresponding papers.
Implementation details. The data for both modalities are mean-centered and then normalized to have unit Euclidean length. We use principle component analysis to project the image into a lower dimensional (set to 64) space, and the number of bases in sparse coding is set to 512 (). The latent dimension of matrix factorization for text data is set equal to the number of code bits, e.g., 16, 32 etc. The mapping parameters (denoted as ) are initialized by solving a relatively easy problem (similar algorithm with that presented in solving ). Then the quantization parameters (denoted as ) are initialized by conducting composite quantization [34] in the common space.
There are five parameters balancing different trade-offs in our algorithm: the sparsity degree , the scale-balance parameter , the alignment degree in the common space , the correlation degree of the quantization , and the penalty parameter . We simply set in our experiments as it has already shown satisfactory results. The other four parameters are selected through validation (by varying one parameter in while keeping others fixed) so that the MAP value, when using the validation set (a subset of the training data) as the queries to search in the remaining training data, is the best. The sensitive analysis of these parameters is presented in Section 6.3.
6.2 Results
Results on Wiki. The comparison in terms of MAP@ and the precision curve is reported in Table 2 and the first row of Figure 3. We can find that our approach, CMCQ, achieves better performance than other methods over the text to image task. While over the image to query task, we can see from Table 2 that the best performance is achieved by MLBE with 16 bits and 32 bits, and CMFH with 64 bits and 128 bits. However, the performance of MLBE decreases as the code length gets longer. Our approach, on the other hand, is able to utilize the additional bits to enhance the search quality. In comparison with CMFH, we can see that our approach gets the similar results.
Results on FLICKR25K. The performance on the FLICKR25K dataset is shown in Table 2 and the second row of Figure 3. It can be seen that the gain obtained by our approach is significant over both cross-modal search tasks. Moreover, we can observe from Table 2 that the results of our approach with the smallest code bits perform much better than other methods with the largest code bits. For example, over the text to image task, the MAP of our approach, CMCQ with bits, is , about larger than , the best MAP obtained by other baseline methods with bits. This indicates that when dealing with high-dimensional dataset, such as FLICKR with -dimensional image features and -dimensional text features, our method keeps much more information than other hashing-based cross-modal techniques, and hence produces better search quality.
Results on NUS-WIDE. Table 2 and the third row of Figure 3 show the performance of all the methods on the largest dataset of the three datasets, NUS-WIDE. One can observe that the proposed approach again gets the best performance. In addition, it can be seen from the figure that in most cases, the performance of our approach barely drops with increasing value of . For instance, the precision of our approach over the text to image task with bits is , and the precision is , which suggests that our method consistently keeps a large portion of the relevant items retrieved as the number of retrieved items increases.
6.3 Empirical analysis
Comparison with semantics-preserving hashing. Another challenging competitor for our approach is the recent semantics-preserving hashing (SePH) [9]. The comparison is shown in Table 3 (A). The reason of SePH outperforming ours is that SePH exploits the document-label information, which our method doesn’t use for two reasons: (1) the image-text correspondence information comes naturally and easily, while the label information is expensive to get; (2) exploiting label information may tend to overfit the data and not generalize well to newly-coming classes. To show it, we conducted an experiment: split the NUS-WIDE training set into two parts: one with five concepts for training, and the other with other five concepts for the search database whose codes are extracted using the model trained on the first part. Our results as shown in Table 3 (B) are better than SePH, indicating that our method can well generalize to newly-coming classes.
The effect of intra-document correlation. The intra-document correlation is imposed in our formulation over two spaces (the quantized space and the common space) by two regularization terms controlled respectively by parameter and . In fact, it is possible to just add one such term and set the other to be [math]. Specifically, if , our approach will degenerate to conducting composite quantization [34] separately on each modality, and if , the proposed approach will lack the explicit connection in the common space. In either case, the bridge that links the pair of image and text would be undermined, resulting in reduced cross-modal search quality. The experimental results shown in Figure 1, validate this point: the performance of our approach when considering both of the intra-document correlation terms is much better.
The effect of dictionary. One possible way for our approach to better catch the intra-document correlation is to use the same dictionary to quantize both modalities, i.e., adding constraint in the Formulation 3, which is similar to [10]. This might introduce a closer connection between a pair of image and text, and hence improve the search quality. However, our experiments shown in Figure 2 suggest that this is not the case. The reason might be that using one dictionary for two modalities in fact reduces the approximation ability of quantization when using two dictionaries.
Parameter sensitive analysis. We also conduct the parameter sensitive analysis to show that our approach is robust to the change of parameters. The experiments are conducted on FLICKR and NUS-WIDE using a validation set, to form which we randomly sample a subset of the training dataset. The size of the validation set is 1000 and 2000 respectively for FLICKR and NUW-WIDE. To evaluate the sensitive of the parameter, we vary one parameter from 0.001 to 10 (1 for ) while keep others fixed.
The empirical results on the two search tasks (task1: image to text and task2: text to image) are presented in Figure 4. It can be seen from the figure that our approach can achieve superior performance under a wide range of the parameter values. We notice that when the parameter gets close to 1, the performance drops suddenly. The reason might be that with a larger sparsity degree value , the learnt image representation in the common space would carry little information since the learnt is a very sparse matrix.
7 Conclusion
In this paper, we present a quantization-based compact coding approach, collaborative quantization, for cross-modal similarity search. The superiority of the proposed approach stems from that it learns the quantizers for both modalities jointly by aligning the quantized approximations for each pair of image and text in the common space, which is simultaneously learnt with the quantization. Empirical results on three multi-modal datasets indicate that the proposed approach outperforms existing methods.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1[1] M. M. Bronstein, A. M. Bronstein, F. Michel, and N. Paragios. Data fusion through cross-modality metric learning using similarity-sensitive hashing. In CVPR , pages 3594–3601. IEEE Computer Society, 2010.
- 2[2] T.-S. Chua, J. Tang, R. Hong, H. Li, Z. Luo, and Y.-T. Zheng. Nus-wide: A real-world web image database from national university of singapore. In Proc. of ACM Conf. on Image and Video Retrieval (CIVR’09) , Santorini, Greece., July 8-10, 2009.
- 3[3] M. Datar, N. Immorlica, P. Indyk, and V. S. Mirrokni. Locality-sensitive hashing scheme based on p-stable distributions. In Proceedings of the 20th ACM Symposium on Computational Geometry, Brooklyn, New York, USA, June 8-11, 2004 , pages 253–262, 2004.
- 4[4] G. Ding, Y. Guo, and J. Zhou. Collective matrix factorization hashing for multimodal data. In 2014 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2014, Columbus, OH, USA, June 23-28, 2014 , pages 2083–2090, 2014.
- 5[5] Y. Gong, S. Lazebnik, A. Gordo, and F. Perronnin. Iterative quantization: A procrustean approach to learning binary codes for large-scale image retrieval. IEEE Trans. Pattern Anal. Mach. Intell. , 35(12):2916–2929, 2013.
- 6[6] H. Jégou, M. Douze, and C. Schmid. Product quantization for nearest neighbor search. IEEE Trans. Pattern Anal. Mach. Intell. , 33(1):117–128, 2011.
- 7[7] S. Kim, Y. Kang, and S. Choi. Sequential spectral learning to hash with multiple representations. In Computer Vision - ECCV 2012 - 12th European Conference on Computer Vision, Florence, Italy, October 7-13, 2012, Proceedings, Part V , pages 538–551, 2012.
- 8[8] S. Kumar and R. Udupa. Learning hash functions for cross-view similarity search. In T. Walsh, editor, IJCAI , pages 1360–1365. IJCAI/AAAI, 2011.
