An End-to-End Solution for Effectively Demoting Watermarked Images in Image Search
Ning Ma, Xin Zhao, Mark Bolin

TL;DR
This paper presents an end-to-end approach combining watermark feature extraction and a hybrid ranking metric to effectively demote watermarked images in search results, improving image search quality.
Contribution
It introduces a novel hybrid metric incorporating watermark signals and demonstrates its effectiveness in demoting watermarked images in search rankings.
Findings
Deep CNNs achieve high accuracy in watermark detection.
Domain-based watermark classification enhances detection.
The hybrid metric significantly reduces watermarked images in search results.
Abstract
We propose an end-to-end solution, from watermark feature generation to metric design, for effectively demoting watermarked images surfed by a real world image search engine. We use a few fundamental techniques to obtain effective watermark features of images in the image search index, and utilize the signals in a commercial search engine to improve the image search quality. We collect a diverse and large set (about 1M) of images with human labels indicating whether the image contains visible watermark. We train a few deep convolutional neural networks to extract watermark information from the raw images. The deep CNN classifiers we trained can achieve high accuracy on the watermark test data set. We also analyze the images based on their domains to get watermark information from a domain-based watermark classifier. We design a new novel hybrid metric which includes the relevance, image…
| label | Traing | Validation | Testing |
|---|---|---|---|
| 1:Watermarked | 587K | 32K | 33K |
| 0:No Watermark | 646K | 36K | 36K |
| Model | Test Accuracy |
|---|---|
| Resnet50 | 69.92% |
| Inception-V3 | 64.12% |
| Densenet161 | 68.93% |
| Resnet152 | 70.63% |
| Model | Test Accuracy |
|---|---|
| Resnet50 | 84.45% |
| Inception-V3 | 85.70% |
| Densenet161 | 83.96% |
| Resnet152 | 83.86% |
| Resnet50 + Domain | 87.04% |
| Inception-V3 + Domain | 87.84% |
| Densenet161 + Domain | 86.61% |
| Resnet152 + Domain | 86.49% |
| Watermark Domain List | |
|---|---|
| 1. clipartartists.com | 2. www.gettyimages.com |
| 3. www.alamy.com | 4. www.shutterstock.com |
| 5. www.dreamstime.com | 6. www.cosplayfancy.com |
| 7. www.teamclipart.com | 8. www.colourbox.de |
| 9. www.recipestable.com | 10. www.sheepskintown.com |
| Ranker | Watermark Rate | NDCG |
|---|---|---|
| No watermark signal | 5.2% | 62.3 |
| Domain watermark signal | 4.7% | 62.4 |
| DNN watermark signal | 3.9% | 62.5 |
| Both watermark signals | 3.8% | 62.5 |
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Steganography and Watermarking Techniques · Advanced Image and Video Retrieval Techniques · Image Retrieval and Classification Techniques
An End-to-End Solution for Effectively Demoting Watermarked Images in Image Search
Ning Ma Xin Zhao Mark Bolin
Microsoft
{ninm, xinzhao, markbo}@microsoft.com Currently a software engineer at Pinterest
Abstract
We propose an end-to-end solution, from watermark feature generation to metric design, for effectively demoting watermarked images surfed by a real world image search engine. We use a few fundamental techniques to obtain effective watermark features of images in the image search index, and utilize the signals in a commercial search engine to improve the image search quality. We collect a diverse and large set (about 1M) of images with human labels indicating whether the image contains visible watermark. We train a few deep convolutional neural networks to extract watermark information from the raw images. The deep CNN classifiers we trained can achieve high accuracy on the watermark test data set. We also analyze the images based on their domains to get watermark information from a domain-based watermark classifier. We design a new novel hybrid metric which includes the relevance, image attractiveness and watermark information all together. We demonstrate that using these watermark signals together with the new metric in image search ranker can significantly demote the watermarked images during the online image ranking.
1 Introduction
Watermarking is a widely used technique to protect the copyright of image photography. There are a huge amount of watermarked images existing online. For example, a few famous image stock websites use watermarks to protect their high quality images from being copied by a third party. The drawback is that images with visible watermarks are often seen when customers are searching images on search engine like Bing, Google or Yahoo. The watermarked images can be annoying and degenerate customers’ experience. Some researchers have looking into watermark removal [4, 3, 12, 14] techniques to remove the watermark from the images or video. Most algorithms only work well in special situations, such as the [4] where the watermark has consistent pattern and does not have large variation. However, the watermark removal techniques are not very useful in image search mainly due to two reasons. First, the search engine should not remove the watermark before returning the search results to the users. This will remove the copyright protection for the original images and cause legal issues. Secondly, these techniques will not work well when the watermark has large variation which is exactly the case in a real image search index. Previous work [9, 11] demonstrates that an universal image attractiveness model can indicate the impact of the watermark by producing lower attractiveness score for the watermarked images than the original images. However, in the DARN model [9], when the model predicts the score, it will take all possible image attributes into consideration, resulting in insignificant impact of watermark as shown in Figure 1. It shows that the score will be significantly decreased only when there are massive watermarks on the original image. The NIMA[11] model even rates the image having most watermarks with highest score. It is likely due to the fact that the images in the AVA database does not contain watermarked images. So, purely using model based image attractiveness score to indicate watermark is not suitable as well.
Regarding the application of image retrieval, a more appropriate approach is to demote images whose quality are significantly impaired by watermarks. In this paper, we propose a few fundamental techniques to obtain effective watermark signals for images coming from a real image search index, and utilize those watermark signals in a commercial search engine to improve the image search quality. Benefiting from the fast advance of deep learning, deep convolutional neural networks (CNN) have been widely used in image classification and detection tasks, and have achieved performance comparable to human. In section 2.1, we train a few deep CNN models to predict the probability that an image contains a watermark using Resnet [7], Densent [8], and Inception-V3 [10] as the backbone. The model is trained end to end on a large image set with a variety of watermarks collected from real online images. The detail of the data set is described in section 2.1. We show that the prediction accuracy of the deep CNN models are very promising on the data set with such diverse watermark patterns. This indicates the potential of building a DNN based universal watermark classifier. In section 2.2, we also obtain an additional watermark signal by analyzing their corresponding domain properties. Our analysis indicates that domain is a very strong indicator of the watermark signal. This makes sense as a lot of watermarked images come from stock image website. However, in order to make these watermark features take effects in image ranker, the image ranker must have proper metric to reflect watermark information. In image retrieval, the metric used is the normalized discounted cumulative gain (NDCG) computed by where is the integer label for the relevance labels of URL in the sorted list and is the normalization factor. Since the rating score only considers the relevance, the image ranker will not pick up the watermark information even if we have watermark features available. In section 3, we introduce a novel hybrid metric which includes relevance, image attractiveness and watermark information in one place. We learn the weights between those factors from a side-by-side labeled data. In section 4, we demonstrate the effectiveness of demoting watermarked images in image search engine by utilizing those watermark signal and the new metric in the image ranker.
2 Watermark Signal
In this section, we demonstrate how we obtain watermark signals from two different approaches. The first approach is to get a watermark signal from the raw image content. This is a more biologically plausible method as humans only need to look at the raw image to tell if it contains a watermark. The second approach is from its corresponding domain information.
2.1 Image content based watermark signal
A human can tell whether an image contains a watermark by directly looking at the image. Ideally, we should be able to train a similar classifier reflecting the probability that the image contains a watermark. The probability should reflect the visibility of the watermark in the images. Less visible watermarks should get lower probability.
Data collection: We scraped a large amount of images from web image search results. For each image, we had 1-5 judges rate if this image contained a visible watermark. If any judge thought the image contained a visible watermark, the image would be labeled as positive, otherwise negative. Since the non-watermarked images are more than the watermarked images, we then randomly sample images from non-watermarked images, so that the watermarked and non-watermarked images are balanced. Next, we split the data into training, validation and test set with the rate 90%:5%;5%. We also remove images which are broken or can not be downloaded. Table 1 shows the numbers of images we used to train and test the model. Figure 2 shows a few examples of the watermarked images.
Data augmentation: We have about one millions images half of which have watermarks. During training, we did the following data augmentations to improve the performance. We used center cropping to obtain the images satisfying the input dimensional requirement of the different deep CNN models. Before cropping, we scaled the image dimension slightly larger than the model input dimension. We also used horizontal/vertical flips to increase the training dataset without losing the original watermark.
Model: We explored a few deep convolutioal neural network structures - Resnet50, Resnet152[7], Densenet161[8], and Inception[10]. We replace the final output classification layer with a binary classification layer. In the Inception-V3 model, we also replaced the intermediate auxiliary classification layer with a binary classification layer. The final loss function is , where is the cross-entropy loss function of the final output layer and the is the loss of the auxiliary classification layer.
Training: First, we use the transfer learning by freezing the models pretrained on ImageNet [5], and only retraining the top and the auxiliary classification layer. The training error and validation error stops decreasing before ten epochs. Table 2 shows the accuracy of the models on the test data set after training 10 epochs. The ResNet152 obtained the best accuracy on the test data with 70.63% accuracy. However, the overall accuracy of the transfer learning is low. Next, we start training the whole network from end to end. Figure 3 shows the progress of the training and validation accuracy over epochs. We choose the model which performs best on the validation set and evaluate on the test set. The Inception-V3 has the best accuracy on the test set with 85.70% accuracy. Both the validation and training accuracy are significantly improved after training the network end to end. This is likely because the the high level DNN features needed for watermark detection differ from general image classificaton.
During training, we set the learning rate as and reduces it by half every 5 epochs. Unlike traditional fine tuning where the learning rate is set to be much smaller, we use the same learning rate and annealing procedure in both transfer learning and end-to-end training. This gives the model more freedom to discover the subtle watermark information. Table 3 shows the accuracy of the end-to-end retrained models on the test data set. Only the models with best performance on the validation set are evaluated on the test data. Our results show that the deep CNN can caputre the watermark signal from image pretty well. Also, training end-to-end significantly outperformed training just the final layers.
Figure 4 shows the prediction results using the trained resnet50 model. The images in the top row are the one detected with high probability of having a watermark. The bottom row includes the images detected with low probability of including watermark. We can see that the prediction is quite good. Another interesting thing we can observe is that the watermark is not simply just detecting text on the images. For example, the third, fourth, fifth and sixth images in the bottom row all contain texts, and all of them are successfully recognized as not containing watermark.
2.2 Domain based watermark signal
For the images in an image search index, the domain where the images come from is also a very strong signal. Many watermarked images in the web index are coming from stock photo websites. The deep CNN based classifier can not achieve 100% prediction accuracy on these images. However, a domain based watermark classifier can achieve a higher precision on predicting watermarked images coming from these websites.
In the training data, we group images based on the domains where those images are hosted. We compute the percentage of watermarked images in each domain, which is the ratio of the number of the watermarked images to the all images hosted on this domain. We select domains which produce more than 5 images and have a watermark rate higher than 90%. In the training data set, there are about 4.7K domains out of about 272K domains that satisfy this condition. We put these domains in a known watermark domain list. For any image coming from those domain, we will predict that this image has a visible watermark regardless of the prediction of the deep CNN classifier. Table 4 shows a few domains containing high percentage watermarked images.
The downside of the domain based approach is that we must have the domain information of the image source. This is not biologically plausible as humans do not need other information besides the raw image to detect the watermark. Also, the domain is dynamic information that can change over time. However, this information is common in images collected from the web. When using this domain information together with the content based watermark information, the accuracy on the validation can be improved as shown in the last four rows of the Table 3.
3 The Metric
3.1 LambdaMART Ranking Algorithm
LambdaMART [13] is a widely used algorithm in information retrieval to train image ranker. It is built on MART [6]. MART builds a regression tree to model the functional gradient of the cost function of interest which leads to the LambdaRank [1] functional gradients. For more details, we refer to the corresponding literatures [6, 2, 1, 13].
In information retrieval, the widely used metric is the normalized discount cumulative gain (NDCG). During ranker training, each document has a list of features and an associated rating. The LambdaMart model uses these features and rating of the document to optimize the metric and produce a predicted rank score by which the documents are finally ranked.
LambdaRank can be applied to any image relevane (IR) metric. In the original LambdaRank [1] paper, the NDCG is defined as {IEEEeqnarray}rCl
Ni = ni∑j=1^T(2^r(j) - 1)/log(1 + j)
where is the integer label for the relevance level of URL in the sorted list. is the normalization factor. However, the rating only considers image relevance. As a result, the ranker will not be able to pick up the watermark features even when they are available. In our application, instead of using a pure relevance rating, we have a mixed rating score for each image. The rating combined the relevance, image attractiveness and watermark via metric learning. For the image having watermarks, the attractiveness rating will be multiplied by a penalty factor.
In the following subsections, we will discuss how we get the labels and train the hybrid metric based on these labels.
3.2 Relevance and Watermark label
The relevance and watermark labeling is relatively straightforward. For the relevance label, if the query perfectly matches the image content, the image will be labeled as ’Exellent’. If the image content matches the main content of the query, it is labeled as ’Good’. If the image does not cover the main content of the query, it is labeled as ’Bad’. For watermark, if the image contains a watermark, it is labeld as ‘1’, otherwise ‘0’.
3.3 Image attractiveness labeling
For a specific query, we scrap 30 images for this query and compute their corresponding image attractiveness score using DARN [9] model. We select at most five representive images based on the attractiveness according to their ranking percentile 100%,75%, 50%, 25%, 0%. We call this five images as the reference images set for this query. For each image to be judged, we let the judge compare the image against the five reference images by rating it with ‘win’, ‘loss’ or ‘equal’. We compute the judged attractiveness score as where , and are the number of wins, loss and total judgments, respectively. The Figure 5 shows how the judged attractiveness score is computed for a ’cat’ image.
3.4 Side by side labeling
The remaining problem is how do we design a hybrid new rating which combines the relevance, image attractiveness and watermark together in one place. The rating needs to solve a few questions - (1) how do we weight the importance between relevance, image attractiveness, and watermark; (2) how the watermark will effect the rating; and (3) how the model can have the freedom to learn that the relevance is the most important factor which is usually required in image retrieval.
To learn the hybrid rating, we select 200 queries and scrape 30 images for each query. The data used to train the metric is independent of the data we used to train the watermark classifier and image ranker. Within each query, we randomly pairs the images. Each image in a pair will have relevance label, image attractiveness label, and watermark label provided by the judges. Out of these images, about 10% contains watermarks. Additionally, we let judge decide which image is better overall for this particular query, by choose five labels ”left better, left slight better, equal, right slight, better”. According to this relative pairwise labeling, we can directly learn the metric weights among the three factors. Figure 6 demonstrates the process of this side-by-side labeling procedure. For example, in the first pair, the judge chooses that the left image is better even its image attractiveness is poor. The rating learns that image relevance is more important. In the second pair, the judge chooses the right image better because it is more appealing. The rating learns that image attractiveness is important. In the third pair, the judge thinks right is better as the left image has watermark and is less attractive. So, the rating will learn the importance of watermark and image attractiveness. Eventually, the rating should be able to learn how important each factor is and finally converges to a hybrid rating reflecting a user’s overall experience.
Eq (8) to Eq (3.4) is our design of the new hybrid rating. Each image’s rating depends on three labels - the relevance label :{0:Bad, 1:Good, 2:Exellent}, the attractiveness label , and the watermark label :{0:No Watermark, 1:Has visible watermark}. is a learnable parameter vector where denotes the best rating an image will get for this relevance label . Here, ‘best rating’ is the rating an image will get which has attractiveness score 1.0 and no watermark. is also a learnable parameter which defines the rating buffer between two consecutive relevance label. It is to ensure that the rating of an image with better relevance label is larger than the one of another image with worse relevance label by a certain margin. stands for watermark penalty which denotes additional penalty it will apply on the images attractiveness score for an image containing watermark. For a given relevance label , in Eq (8) and Eq (8), we get the rating for this relevance label , and the rating for the relevance downgraded by one level . The score is computed via Eq (3.4) to Eq (3.4). Eq (3.4) is the buffer region between two relevance labels. Intuitively, it means that if an image gets a relevance label ’Exellent’ even with 0 image attractiveness score and containing watermark, this image’s rating will still be more than another image, which has only ’Good’ relevance label but perfect secondary score, by the margin of . This gives the model the freedom to learn that the relevance is always the dominate factor. Then, the image score will be further scaled by the image attractiveness as shown in Eq (3.4).
{IEEEeqnarray}
rCl
RatingIR & = RatingIRs[IR]
{IEEEeqnarray}rCl
RatingIRPrev & = RatingIR[max(IR - 1,0)]
{IEEEeqnarray}rCl
BucketWidth & = RatingIR - RatingIRPrev
{IEEEeqnarray}rCl
IA & = (1 - WMP * WM) * IA
{IEEEeqnarray}rCl \IEEEyesnumber \IEEEyessubnumber* Rating & = RatingIRPrev
-
BucketWidth * (1 - Gamma)
-
IA * BucketWidth * Gamma
{IEEEeqnarray}rCl
Rating & = Rating / max(RatingIR)
3.5 Metric Loss function
We use similar pairwise rank loss proposed in [9] to learn the rating. We would like to learn the rating with pairs of images judged side-by-side with a relative label as demonstrated in 3.4. The goal is to learn a model such that the images with higher rating (i.e., indicates image is better than when assessed based on the labels. Figure 8 graphically illustrates metric learning structure. The model is designed to predict the mean and variance of the rating for an image as shown in Figure 8A, and the decision boundary that specifies how differences in these distributions correspond with judge preferences as shown in Figure 8B.
Let us define and as the mean rating score generated by Eq (8) and variance for an image, respectively. For an image pair , define as the posterior probability that the image pair is labeled as .
Assume each image can be rated by a large number of experts who have extremely high confidence of the overall rating. According to central limit theorem, the rating scores received by each image will follows a normal distribution . Again, the denotes the rating in Eq (8) and stands for variance. For the two images and , the score difference is also a normal distribution as shown in left panel of Figure 8(A). The model learns four boundaries which are used to map each pair to a label according to their score difference. We define the four boundaries as . Let denote the probability that the pair is labeled as (indexing {left better, left slightly, equal, right sightly, right better} as {0, 1, 2, 3, 4}), and and as the mean and variance of the score difference of pair. is the probability of the pair labeled as and is represented by the area under the normal distribution of the score difference of the pair between boundary and , i.e., . Let indicate the number of judges labeling pair as label . Thus, we define a log maximum likelihood cost function as
{IEEEeqnarray}
rCl cost = -log∏i = 1^N∏j = 0^4(pi^j)^ni^j = -∑i = 1^N∑j = 0^4ni^jlog(pi^j)
where N is the number of pairs.
We use backpropagation to jointly learn the decision boundaries, the variance, and the parameters to compute the hybrid rating in Eq (8) to Eq (8). Then, each image document will get an overall rating using this learned rating based on their labels. This new rating replaces the original rating in Eq (3.1) and is used during image ranker training.
4 Online results of utilizing watermark signal
During ranker training, each document’s rating is computed using the learned new hybrid rating. The image ranker will take a list of documents each of which associates with a new rating and a list of features. LambdarMart algorithm ia used to trained the image ranker. We trained two image rankers. The control ranker uses the original features during training. In the experimental ranker training, we add the watermark signal obtained from deep CNN model and domain analysis into the existing feature pool. If the image’s doamin does not belong to the domain black list, we will use the watermark probability predicted by the Resnet50 model. Otherwise, the watermark probability is 1. Table 5 shows that the experimental ranker’s watermark rate is reduced from 5.2% to 4.7%, relatively by 10% after adding the domain based watermark feature. The watermark rate is futher decreaded to 3.7%, relatively by 20% after adding the DNN based watermark feature. The NDCG is also improved. Figure 9 shows a few examples where the watermarked images are demoted for a few example queries. For example, for the query ‘New York night scene’ in the first row, the control ranker surfaced 5 watermarked images (highlighted by red bounding box), while the new ranker shows no watermarked images.
5 Discussion
We proposed a few techniques to obtain watermark signals from online images and demonstrated the effectiveness of utilizing them in the image search. We designed a hybrid metric for the image ranker to enable it to pick up watermark related features. This sheds light on the solution to provide better image search quality to the user by effectively demoting watermarked images. More research can also be done to understand what image attributes are mainly responsible for watermark detection and which part of the neural network is sensible for the watermark information.
6 Acknowledgement
We thank Rui Xia, Viktor Burdeinyi, Yiran Shen, Houdong Hu and Arun Sacheti for valuable help.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1[1] C. Burges, R. Ragno, and Q. Le. Learning to rank with non-smooth cost functions. In Conference on Neural Information Processing Systems (NIPS) , 2006.
- 2[2] C. Burges, T. Shaked, E. Renshaw, A. Lazier, M. Deeds, N. Hamilton, and G. Hullender. Learning to rank using gradient descent. In International Conference on Machine Learning (ICML) , 2005.
- 3[3] M. Dashti, R. Safabakhsh, M. Pourfard, and M. Abdollahifard. Video logo removal using iterative subsequent matching. In The International Symposium on Artificial Intelligence and Signal Processing (AISP) , 2015.
- 4[4] T. Dekel, M. Rubinstein, C. Liu, and W. Freeman. On the effectiveness of visible watermark. In Conference on Computer Vision and Pattern Recognition (CVPR) , 2017.
- 5[5] J. Deng, W. Dong, R. Socher, L. Li, K. Li, and F. Li. Imagenet: A large-scale hierarchical image database. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR) , 2009.
- 6[6] J. Friedman. Greedy function approximation: A gradient boosting machine. In Technical report, Dept. Statistics, Stanford, 1999 , 1999.
- 7[7] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In Conference on Computer Vision and Pattern Recognition (CVPR) , 2016.
- 8[8] G. Huang, Z. Liu, L. Maaten, and K. Weinberger. Densely connected convolutional networks. In Conference on Computer Vision and Pattern Recognition (CVPR) , 2017.
