Learning cross space mapping via DNN using large scale click-through logs
Wei Yu, Kuiyuan Yang, Yalong Bai, Hongxun Yao, Yong Rui

TL;DR
This paper introduces a deep neural network model called cross space mapping (CSM) that maps images and queries into a common space for improved image-query similarity measurement, trained on large-scale click-through logs.
Contribution
The paper proposes a unified DNN model for image-query similarity that jointly models images and queries in a shared space, trained on extensive click-through data.
Findings
The CSM model outperforms existing methods in image retrieval accuracy.
Training on 23 million click pairs enhances generalization and robustness.
Qualitative and quantitative evaluations confirm the effectiveness of the approach.
Abstract
The gap between low-level visual signals and high-level semantics has been progressively bridged by continuous development of deep neural network (DNN). With recent progress of DNN, almost all image classification tasks have achieved new records of accuracy. To extend the ability of DNN to image retrieval tasks, we proposed a unified DNN model for image-query similarity calculation by simultaneously modeling image and query in one network. The unified DNN is named the cross space mapping (CSM) model, which contains two parts, a convolutional part and a query-embedding part. The image and query are mapped to a common vector space via these two parts respectively, and image-query similarity is naturally defined as an inner product of their mappings in the space. To ensure good generalization ability of the DNN, we learn weights of the DNN from a large number of click-through logs which…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsImage Retrieval and Classification Techniques · Advanced Image and Video Retrieval Techniques · Multimodal Machine Learning Applications
Learning Cross Space Mapping via DNN using Large Scale Clickthrough Data
Wei Yu*∗*, Kuiyuan Yang, Yalong Bai, Hongxu Yao, and Yong Rui W. Yu, Harbin Institute of Technology, Harbin, Heilongjiang 150090. E-mail:[email protected]. Yang, Microsoft Research, Beijing, Beijing 100080. E-mail: [email protected]. Bai, Harbin Institute of Technology, Harbin, Heilongjiang 150090. E-mail:[email protected]. Yao, Harbin Institute of Technology, Harbin, Heilongjiang 150090. E-mail:[email protected]. Rui, Microsoft Research, Beijing, Beijing 100080. E-mail: [email protected].
Abstract
The semantic gap between low-level image pixels and high-level semantics has been progressively bridged by the continuously developing Deep Neural Network (DNN), and achieved great success in almost all image classification tasks. To extend the power of DNN to image retrieval task, we proposed an unified DNN model for image-query similarity calculation by simultaneously modeling image and query in one network. The unified DNN is named as Cross Space Mapping (CSM) model, which contains two main parts, i.e., convolutional part and query-embedding part. The image and query are mapped to a common vector space via these two parts respectively, and image-query similarity is naturally defined as inner product of their mappings in the space. To ensure good generalization ability of the DNN, we learn the weights of the DNN on a large scale clickthrough dataset which consists of 23 million clicked image-query pairs between 1 million images and 11.7 million queries. Both the qualitative results and quantitative results on an image retrieval evaluation task with 1000 queries demonstrate the superiority of the proposed method.
Index Terms:
Image retrieval, Cross space mapping, Deep neural network.
I Introduction
With the popularization of digital cameras and storage devices, millions images are taken everyday and billion images are hosted in photo-sharing websites and image search engines. A nature problem with such gigantic image collections is how to retrieve the relevant images for everyday users, which is also well known as image retrieval problem. Though image retrieval is with similar user-interaction mode with document retrieval (users provide a few keywords as query, and the machine returns a list of relevant documents), image retrieval is more challenge as machine cannot directly use string matching to check whether the textual query matching with the candidate images. Current image search engines mainly rely on the surrounding texts of an image to represent textual information conveyed in the image, and convert image retrieval into document retrieval. However, surrounding texts are not always available or relevant to the image, which leads large number of images irretrievable or irrelevant.
In order to make all images retrievable and improve the relevance of retrieved images, the machine needs the ability to directly measure the image-query similarity by extracting information from image itself. Though sounds intuitive, this is a difficult task and far from being solved for the following two reasons:
- •
Extract semantic information from images is hard even with the state-of-the-art hand crafted image features (e.g., super-vector coding [21], fisher vector [12], spatial pyramid matching [9], etc.).
- •
The number of possible queries is huge even if not infinite, it is impractical to build classifiers query by query as image classification tasks.
Recent significant progress in DNN has shown the possibility and superiority in automatically learning representations from raw inputs such as images and texts. Inspired by the success of DNN in image classification and word embedding tasks, we proposed an unified DNN to model the image-query similarity. The proposed DNN unifies Convolutional Neural Network (CNN) and Word Embedding Network (WEN) to generate representations from images and queries respectively, where the final outputs of CNN and WEN are residing in the same vector space and their inner product is defined as the image-query similarity. CNN has shown its superiority over hand crafted image features in extracting semantic information from images via the automatically learned features [8, 19]. WEN has been successfully used in natural language processing tasks by learning low dimensional vector representations of words [3], and query representation is modeled by the linear weighted combination of word vectors. With the unified DNN, both image and query are mapped into the same feature vector space as illustrated in Figure 1.
DNN requires large number of training data to learn its parameters. Here, we utilize a large scale clickthrough dataset collected from Bing image search as the training dataset, which contains of 23 million clicked image-query pairs from 11.7 million queries and 1 million images [7]. The large number of queries, images and their associations provide a well coverage of the sample space. With such large number training examples, there is no observable overfitting problem even without using dropout [8].
Qualitative results show our learned CSM model constructs a meaningful common vector space for both image and query. We further evaluate the learned DNN on an image retrieval task with 1000 queries. The quantitative results on image retrieval comparing several competing methods demonstrate the effectiveness of the proposed method.
The rest of the paper is organized as follows. Related work is presented in Section II, the unified DNN for jointly image-query modeling and learning is introduced in Section III. Experimental results on a large scale clickthrough dataset are presented in Section V. Finally, we conclude this work in Section VI.
II RELATED WORK
As one important domain of information retrieval, image retrieval has been intensively studied for decades in both academic and industrial community [20]. However, current image retrieval systems still mainly rely on surrounding texts of images to perform the retrieval task. As the missing and noisy problem of surrounding texts, many research works have been proposed to use the image content to measure image-query similarity. With the continuously developing image content understanding techniques especially with the rebirth of convolutional neural network, the image content is gradually playing more important role.
II-A Image annotation as intermediate step
Automatic image annotation is the process by which a machine automatically assigns keywords to an image, and image retrieval is performed over the annotated keywords. A typical pipeline of image annotation is firstly representing images with visual features, then predicting the keywords of images by machine learning algorithms. According the algorithms used, image annotation approaches can be roughly divided into two categories, i.e., model-based approach [15, 14] and data-driven approach [16].
In model-based approaches, image annotation is performed as multi-class or multi-label classification problem, where a manually labeled dataset is used for learning models such as SVM and boosting [15, 14]. Model-based approaches often work with thousands categories, which are impractical to scale up to millions or even more queries.
Compared with model-based approaches, data-driven approaches can be performed without the limits of the number of queries. In a data-driven approach, the annotation of an image is assigned by propagating annotations of its similar images [16]. Due the limitation of low-level image features, data-driven approaches only work well for images with enough duplicates in the training set. It is worth mentioning that image annotation is performed in image retrieval as an intermediate step, and queries are further needed to compare with the annotations to accomplish the retrieval.
II-B Joint image and query modeling
To avoid the intermediate step of image annotation, many works studied how to jointly modeling image and query where image-query similarity is directly estimated. There are two main directions in this area, one is using generative models and another is using discriminative models.
Generative models are applied widely in joint image and query modeling as they are easy to take different modalities into account. Different kinds of generative models have been proposed for joint image and query modeling, including latent Dirichlet allocation [1], probabilistic latent semantic analysis [11], hierarchical Dirichlet processes [18], machine translation methods [4] and deep Boltzmann machine [13], etc. As it is still difficult to learn probability on raw images, hand crafted image features are used in the modeling.
Discriminative models are generally with better performance. In discriminative models, joint kernels over image and query are defined and learned for ranking images [6, 17]. Though different image features and diverse kernel functions are considered in these works, their modeling ability still limited by the visual features and their shallow structures. In [5], deep visual-semantic embedding model is proposed to measure image-query similarity by automatically learned convolutional neural network. Unlike our methods, the method still requires ImageNet to do supervised pretraining.
III CROSS SPACE MAPPING MODEL
In this section, the unified DNN is described for image-query similarity modeling and accomplish the cross space mapping. We first introduce CNN and WEN separately for image and query modeling, and then unify these two networks together into one unified DNN to define the image-query similarity. Then the training procedure is introduced to learn the DNN model parameters.
III-A CNN for Image Modeling
Images are stored as raw pixels in the machine, we use standard CNN [8, 10] without softmax outputs for image modeling. The CNN contains seven layers with weights, including five convolutional layers and two fully-connected layers, three max-pooling layers are used following the first, second and fifth convolutional layers, two local contrast normalization layers are used following the first and second max-pooling layers. More details of these operations can be referred in [8]. The lower part of Figure 2 illustrates the architecture of the image part. Via the CNN, an image is mapped to a -dimensional vector space and denoted as .
III-B WEN for Query Modeling
Queries are stored as a set of words in the machine, word embedding [3] is leveraged for query modeling. To this end, we build a vocabulary formed by 50K words with top word-frequency in training set, where =50K. With word embedding, a word is mapped into -dimensional space as using a lookup table, and will be learned in the training procedure. Then a query is mapped to the same space as by weighted linear combination of its words’ vectors, i.e.,
[TABLE]
where is weights for word , and normally defined as the normalized idf weighting
[TABLE]
where , refers to the fraction of corpus queries containing the word . The upper part of Figure 2 illustrates the architecture of the query part, which is a networks with two layers, the first layer takes bag-of-words representation of query as input, the second layer outputs the query embedding vector. The parameters for word embedding is represented as the weights between the two fully-connected layers.
III-C Image-Query Similarity
With the image mapping and query mapping , images and queries both are mapped into a common feature space, and the image-query similarity can be defined as their inner product, i.e.,
[TABLE]
where is image-query similarity.
As the output of the unified DNN model, can be used to determine whether image and is relevant or not, and can be naturally used to ranking candidate images for a specific query.
III-D Training Data Preparing
Given a clickthrough dataset denoted by , and refer to the image set and query set respectively, is the click matrix which represents the corresponding clicks between images and queries in training set. With image-query similarity , we further define the following constraint that requires clicked image-query pairs are with large similarity:
[TABLE]
where , is the clicked images of query , is unclicked images of query .
In web-scale image set, the unclicked image set for each query is often too large for direct optimization. Thus, the practical negative set is a subset sampled from the complementary set of . Here, We propose a preprocessing stage, which attempt to sample the better negative examples as negative set. Considering the click matrix is only partially observed, that is the non-clicked image-query pairs are not necessary irrelevant. As illustrated in Figure 3, the bottom image is denoted as irrelevant to dog by M which is actually relevant. Yet, the top and bottom images share other same queries, such as neapolitan mastiff, which means the bottom image should be removed from the negative set of query dog.
Based on this idea, we denote the first order image relationship matrix as , and the order image relationship matrix can be defined as . In this paper, we utilize to remove the potential relevant images from the negative set of specific query, and the final is sampled from the set .
III-E Training Objective
In order to measure discrimination between and , we define the inter-class scatter of query as:
[TABLE]
where and are the parameters of image mapping and query mapping respectively. As the minimum score difference of all positive-negative image pairs for query , can be regarded as margin in classification tasks, where larger margin would yield better discrimination.
Obviously, parameter vectors and jointly determine the margin given the data. Figure 4 shows different margins using different query mappings with fixed image mapping. Actually, Figure 4 could also be regarded as the cases where the visual features are preselected in image retrieval. The goal of these cases is to find the optimal query weight vector which maximizes the margin, while the visual feature has been fixed such as SIFT and GIST.
However, the preselected visual feature mapping may not have enough ability to distinguish negative set and positive set. Therefore, the inter-class scatter is also influenced by image mapping. Here, our CSM aims at learning the parameter vectors and simultaneously by enlarging the margins over all queries. The training objective of CSM could be formulated as follows:
[TABLE]
To avoid trivial solution, both the norm of and are constrained to be less than 1.
The DNN is trained by stochastic gradient descent with a batch size of 128 queries, for each query, the loss is defined as the negative of its margin. In the training process, the update rule for network parameters is formulated as:
[TABLE]
where is the iteration index, is the momentum set to 0.9, is the momentum variable, is the weight decay set to , is the learning rate and is the average over the batch of the derivative of the objective with respect to , evaluated at . The learning rate is initially set to 0.01, and decreased by a factor of 10 when the margin on a validation set stopped improving.
III-F Image Retrieval System
Based on the learned CSM, we can build a textual query based image retrieval system, as shown in Figure 5. The candidate images in database are translated as visual feature vectors in the mapped space, and the textual query input by user is translated as weight vector in the common mapped space. Through calculating and sorting the scores of input query and candidate image pairs, this system output a ranked image list as the retrieval result. In particular, few input queries can not be matched within the training word set, since the input words don’t appear in the selected training words with top word-frequency. In this case, retrieval system will return a random ranking result.
IV EXPERIMENT SETTING
IV-A Dataset Description
The user-click data was collected from Bing search engine, which is publicly accessible as MSR-Bing Image Retrieval Challenge [7]. In this dataset, images are collected from the Web and labels are the input textual queries from Bing’s users. The dataset is collected based on queries received at Bing Image Search in EN-US market. The dataset comprises two parts: the training dataset, and the dev dataset which label is judged by annotators and used as test dataset.
The training dataset includes 11,701,890 queries, 1,000,000 images and 23,094,592 clicked query, image pairs, where the whole clicked data is randomly sampled from one year’s of Bing Image Search log. The topics of queries are wide and diverse, some examples are shown in Figure 6.
The test dataset is comprised of 1,000 queries and 79,665 images, which are also randomly sampled from the one year’s Bing Image Search log in EN-US market. In order to measure the relevance, a large set of plausible retrieval results are judged manually for each query. The relevance of images are measured with three levels with respect to a query, that is Excellent = 3, Good = 2, Bad = 0. The judgment guidelines and procedure are established to ensure high data quality and consistency.
IV-B Evaluation Criterion
In order to measure the performance of the search results, we adopt Discounted Cumulated Gain (DCG) measurement to quantify the retrieval performance. The standard DCG is defined as:
[TABLE]
where is the count of images in searching list, is the relevance level of the result at position . In our experiment, as previous mentioned and , is the normalizer that make the best equals to 1.
V EXPERIMENT RESULTS
In this section, we demonstrate CSM based image retrieval from both qualitative results and quantitative results.
V-A The Learned Mapping Space
Firstly, we qualitatively demonstrate the effectiveness of CSM by visualizing the learned mapping space. Figure 7 visualizes six randomly selected dimensions of the learned feature space, and the images with high responses of each dimension are showed. The pattern captured by each dimension is both visually and semantically meaningful. Figure 8 demonstrates the effectiveness of inner product in the learned common space by showing nearest neighbor words and images of some exemplar words measured by inner products. Though the nearest words contains some spelling mistakes, it is easy to guess the real meaning.
V-B Overall Performance
In order to validate the overall performance of CSM, we compare CSM with the two state-of-the-art single models on the dataset, i.e., Concept Classification model [2] and Passive-Aggressive model [6]. Concept Classification model builds a binary classifier for each concept using a standard SVM. Passive-Aggressive model utilizes a parametric function to map image into text space, and optimizes a learning criterion related to ranking performance. Both Concept Classification model and Passive-Aggressive model adopt HOG features as the representation of images. In addition, we compare our result with the ideal ranker and the random ranker. The ideal ranking is the optimal ranking list generated by the relevance annotated by annotators, and the random result is a random order of the candidate images. As mentioned in section IV-B, is utilized to capture quantitative results of the performance of the ranking list. The overall performance is shown in Table I.
Because of the nature of test dataset, the average of ideal ranking is less than 1, since excellent candidates image for some test queries are less than 25. CSM achieves much better results than the state-of-the-arts models using sophisticate hand-craft images features, which quantitatively demonstrate the effectiveness of CSM in measure image-similarity for image retrieval.
V-C Detailed Results Analysis
In the whole 1,000 test queries, 71 queries achieve ideal retrieval performance and other 235 queries’ DCG25 are inferior to ideal ranking within 0.05.
Figure 9 shows six retrieval ranking results by CSM including four queries achieved DCG25 above 0.9 and two failure cases with DCG25 close to 0. The query chair, fat cat and church can be matched exactly within the training query set. Though the query beer stein from Germany can not be matched exactly, the training query beer stein is helpful to map effective textual weight vector through WEN. The first failure case of vanese mcneill is caused by the fact of ideal DCG25 nearly being zero, since there are rare relevant images for this query. For the last failure case of american caravansary of the 1920, the key word caravansary is missing in training word set, while other words are not helpful to map effective textual weight vector.
In addition, we further discuss the effects of retrieval performance vs the query length. On one hand, more words embellish the search intention and limit the number of available candidate images, which is demonstrated by Figure 10, where longer queries are with lower .
On the other hand, different query lengths are with different query matching types. Statistically, 392 test queries have exactly matched queries in the training query set, while 19 test queries have no matched queries in the training set. The left 589 test queries can be partly matched, which means these queries contains one or more words in training query set. Different matching types lead to different ranking performance, where no match leads to random ranker as previous mentioned and exact match can produce better ranking results. The partial match is likely to introduce the semantic ambiguous, since the queries with partial match usually are matched with several training queries. As shown in Figure 11, the longer query set includes higher proportion of test queries with partial match.
VI Conclusions
In this paper, we proposed a novel approach for image retrieval, which reformulates image retrieval problem as mapping images and textual queries to one common space with an unified deep neural network. With sufficient training image provided by user clicks, the trained DNN significantly improved the image retrieval performance compared with state-of-the-arts methods based on predefined image features. In addition, CSM model not only measures the similarity between query and image, but also measures the similarity of textual queries and the similarity of images. As the query embedding part still affects by the out of vocabulary problem, in the future we will combine the word embedding from natural language process tasks to enhance query embedding.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1[1] K. Barnard, P. Duygulu, D. Forsyth, N. De Freitas, D. M. Blei, and M. I. Jordan, “Matching words and pictures,” The Journal of Machine Learning Research , vol. 3, pp. 1107–1135, 2003.
- 2[2] G. Carneiro and N. Vasconcelos, “Formulating semantic image annotation as a supervised learning problem,” in Computer Vision and Pattern Recognition, 2005. CVPR 2005. IEEE Computer Society Conference on , vol. 2. IEEE, 2005, pp. 163–168.
- 3[3] R. Collobert and J. Weston, “A unified architecture for natural language processing: Deep neural networks with multitask learning,” in ICML , 2008.
- 4[4] P. Duygulu, K. Barnard, J. F. de Freitas, and D. A. Forsyth, “Object recognition as machine translation: Learning a lexicon for a fixed image vocabulary,” in ECCV . Springer, 2006, pp. 97–112.
- 5[5] A. Frome, G. S. Corrado, J. Shlens, S. Bengio, J. Dean, and T. M. Marc Aurelio Ranzato, “De Vi SE: A Deep Visual-Semantic Embedding Model,” in NIPS , 2013.
- 6[6] D. Grangier and S. Bengio, “A discriminative kernel-based approach to rank images from text queries,” Pattern Analysis and Machine Intelligence, IEEE Transactions on , vol. 30, no. 8, pp. 1371–1384, 2008.
- 7[7] X.-S. Hua, L. Yang, J. Wang, J. Wang, M. Ye, K. Wang, Y. Rui, and J. Li, “Clickage: Towards bridging semantic and intent gaps via mining click logs of search engines,” in Proceedings of the 21st ACM international conference on Multimedia . ACM, 2013, pp. 243–252.
- 8[8] A. Krizhevsky, I. Sutskever, and G. Hinton, “Image Net classification with deep convolutional neural networks,” in Advances in Neural Information Processing Systems 25 , 2012, pp. 1106–1114.
